In the output I am not getting the complete table from Excel - python-3.x

I just started using pandas, i wanted to import one Excel file with 31 rows and 11 columns, but in the output only some columns are displayed, the middle columns are represented by "....", and the first column 'EST' the starting few elements are displayed "00:00:00".
Code
import pandas as pd
df = pd.read_excel("C:\\Users\daryl\PycharmProjects\pandas\Book1.xlsx")
print(df)
Output
C:\Users\daryl\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/daryl/PycharmProjects/pandas/1. Introduction.py"
EST Temperature ... Events WindDirDegrees
0 2016-01-01 00:00:00 38 ... NaN 281
1 2016-02-01 00:00:00 36 ... NaN 275
2 2016-03-01 00:00:00 40 ... NaN 277
3 2016-04-01 00:00:00 25 ... NaN 345
4 2016-05-01 00:00:00 20 ... NaN 333
5 2016-06-01 00:00:00 33 ... NaN 259
6 2016-07-01 00:00:00 39 ... NaN 293
7 2016-08-01 00:00:00 39 ... NaN 79
8 2016-09-01 00:00:00 44 ... Rain 76
9 2016-10-01 00:00:00 50 ... Rain 109
10 2016-11-01 00:00:00 33 ... NaN 289
11 2016-12-01 00:00:00 35 ... NaN 235
12 1-13-2016 26 ... NaN 284
13 1-14-2016 30 ... NaN 266
14 1-15-2016 43 ... NaN 101
15 1-16-2016 47 ... Rain 340
16 1-17-2016 36 ... Fog-Snow 345
17 1-18-2016 25 ... Snow 293
18 1/19/2016 22 ... NaN 293
19 1-20-2016 32 ... NaN 302
20 1-21-2016 31 ... NaN 312
21 1-22-2016 26 ... Snow 34
22 1-23-2016 26 ... Fog-Snow 42
23 1-24-2016 28 ... Snow 327
24 1-25-2016 34 ... NaN 286
25 1-26-2016 43 ... NaN 244
26 1-27-2016 41 ... Rain 311
27 1-28-2016 37 ... NaN 234
28 1-29-2016 36 ... NaN 298
29 1-30-2016 34 ... NaN 257
30 1-31-2016 46 ... NaN 241
[31 rows x 11 columns]
Process finished with exit code 0

To answer your question about the display of only a few columns and "..." :
All of the columns have been properly ingested, but your screen / the console is not wide enough to output all of the columns at once in a "print" fashion. This is normal/expected behavior.
Pandas is not a spreadsheet visualization tool like Excel. Maybe someone can suggest a tool for visualizing dataframes in a spreadsheet format for Python, like in Excel. I think I've seen people visualizing spreadsheets in Spyder but I don't use that myself.
If you want to make sure all of the columns are there, try using list(df) or print(list(df)).
To answer your question about the EST format:
It looks like you have some data cleaning to do. This is typical work in data science. I am not sure how to best do this - I haven't worked much with dates/datetime yet. However, here is what I see:
The first few items have timestamps as well, likely formatted in HH:MM:SS
They are formatted YYYY-MM-DD
On index row 18, there are / instead of - in the date
The remaining rows are formatted M-DD-YYYY
There's an option on read_csv's documentation that may take care of those automatically. It's called "parse_dates". If you turn that option on like pd.read_csv('file location', parse_dates='EST'), that could turn on the date parser for the EST column and maybe solve your problem.
Hope this helps! This is my first answer to anyone who sees it feel free to edit and improve it.

Related

Append a column from one df to another based on the date column on both dfs - pandas

I have two dfs as shown below.
df1:
Date t_factor
2020-02-01 5
2020-02-02 23
2020-02-03 14
2020-02-04 23
2020-02-05 23
2020-02-06 23
2020-02-07 30
2020-02-08 29
2020-02-09 100
2020-02-10 38
2020-02-11 38
2020-02-12 38
2020-02-13 70
2020-02-14 70
2020-02-15 38
2020-02-16 38
2020-02-17 70
2020-02-18 70
2020-02-19 38
2020-02-20 38
2020-02-21 70
2020-02-22 70
2020-02-23 38
2020-02-24 38
2020-02-25 70
2020-02-26 70
2020-02-27 70
df2:
From to plan score
2020-02-03 2020-02-05 start 20
2020-02-07 2020-02-08 foundation 25
2020-02-10 2020-02-12 learn 10
2020-02-14 2020-02-16 practice 20
2020-02-15 2020-02-21 exam 30
2020-02-20 2020-02-23 test 10
From the above I would like to append the plan column to df1 based on the From and to date value in df2 and Date value in df1.
Expected output:
output_df
Date t_factor plan
2020-02-01 5 NaN
2020-02-02 23 NaN
2020-02-03 14 start
2020-02-04 23 start
2020-02-05 23 start
2020-02-06 23 NaN
2020-02-07 30 foundation
2020-02-08 29 foundation
2020-02-09 100 NaN
2020-02-10 38 learn
2020-02-11 38 learn
2020-02-12 38 learn
2020-02-13 70 NaN
2020-02-14 70 practice
2020-02-15 38 NaN
2020-02-16 38 NaN
2020-02-17 70 exam
2020-02-18 70 exam
2020-02-19 38 exam
2020-02-20 38 NaN
2020-02-21 70 NaN
2020-02-22 70 test
2020-02-23 38 test
2020-02-24 38 NaN
2020-02-25 70 NaN
2020-02-26 70 NaN
2020-02-27 70 NaN
Note:
If there is any overlapping date, then keep plan as NaN for that date.
Example:
2020-02-14 to 2020-02-16 plan is practice.
And 2020-02-15 to 2020-02-21 plan is exam.
So there is overlap is on 2020-02-15 and 2020-02-16.
Hence plan should be NaN for that date range.
I would like to implement function shown below.
def (df1, df2)
return output_df
Use: (This solution if From and to dates in dataframe df2 overlaps and we need to choose the values from column plan with respect to earliest date possible)
d1 = df1.sort_values('Date')
d2 = df2.sort_values('From')
df = pd.merge_asof(d1, d2[['From', 'plan']], left_on='Date', right_on='From')
df = pd.merge_asof(df, d2[['to', 'plan']], left_on='Date', right_on='to',
direction='forward', suffixes=['', '_r']).drop(['From', 'to'], 1)
df['plan'] = df['plan'].mask(df['plan'].ne(df.pop('plan_r')))
Details:
Use pd.merge_asof to perform a asof merge on the dataframes d1 and d2 on corresponding columns Date and From with default direction='backward' to create a new merged dataframe df, again use pd.merge_asof to asof merge the dataframes df and d2 on corresponding columns Date and to with direction='forward'.
print(df)
Date t_factor plan plan_r
0 2020-02-01 5 NaN start
1 2020-02-02 23 NaN start
2 2020-02-03 14 start start
3 2020-02-04 23 start start
4 2020-02-05 23 start start
5 2020-02-06 23 start foundation
6 2020-02-07 30 foundation foundation
7 2020-02-08 29 foundation foundation
8 2020-02-09 100 foundation learn
9 2020-02-10 38 learn learn
10 2020-02-11 38 learn learn
11 2020-02-12 38 learn learn
12 2020-02-13 70 learn practice
13 2020-02-14 70 practice practice
14 2020-02-15 38 exam practice
15 2020-02-16 38 exam practice
16 2020-02-17 70 exam exam
17 2020-02-18 70 exam exam
18 2020-02-19 38 exam exam
19 2020-02-20 38 test exam
20 2020-02-21 70 test exam
21 2020-02-22 70 test test
22 2020-02-23 38 test test
23 2020-02-24 38 test NaN
24 2020-02-25 70 test NaN
25 2020-02-26 70 test NaN
26 2020-02-27 70 test NaN
Use Series.ne + Series.mask to mask the values in column plan where plan is not equal to plan_r.
print(df)
Date t_factor plan
0 2020-02-01 5 NaN
1 2020-02-02 23 NaN
2 2020-02-03 14 start
3 2020-02-04 23 start
4 2020-02-05 23 start
5 2020-02-06 23 NaN
6 2020-02-07 30 foundation
7 2020-02-08 29 foundation
8 2020-02-09 100 NaN
9 2020-02-10 38 learn
10 2020-02-11 38 learn
11 2020-02-12 38 learn
12 2020-02-13 70 NaN
13 2020-02-14 70 practice
14 2020-02-15 38 NaN
15 2020-02-16 38 NaN
16 2020-02-17 70 exam
17 2020-02-18 70 exam
18 2020-02-19 38 exam
19 2020-02-20 38 NaN
20 2020-02-21 70 NaN
21 2020-02-22 70 test
22 2020-02-23 38 test
23 2020-02-24 38 NaN
24 2020-02-25 70 NaN
25 2020-02-26 70 NaN
26 2020-02-27 70 NaN
Using pd.to_datetime convert the date like columns to pandas datetime series:
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['From', 'to']] = df2[['From', 'to']].apply(pd.to_datetime)
Create a pd.IntervalIndex from the columns From and to of df2, then use Series.map on the column Date of df1 to map it to column plan from df2 (after setting the idx):
idx = pd.IntervalIndex.from_arrays(df2['From'], df2['to'], closed='both')
df1['plan'] = df1['Date'].map(df2.set_index(idx)['plan'])
Result:
Date t_factor plan
0 2020-02-01 5 NaN
1 2020-02-02 23 NaN
2 2020-02-03 14 start
3 2020-02-04 23 start
4 2020-02-05 23 start
5 2020-02-06 23 NaN
6 2020-02-07 30 foundation
7 2020-02-08 29 foundation
8 2020-02-09 100 NaN
9 2020-02-10 38 learn
10 2020-02-11 38 learn
11 2020-02-12 38 learn
12 2020-02-13 70 NaN
13 2020-02-14 70 practice
14 2020-02-15 38 practice
15 2020-02-16 38 practice
16 2020-02-17 70 exam
17 2020-02-18 70 exam
18 2020-02-19 38 NaN
19 2020-02-20 38 test
20 2020-02-21 70 test
21 2020-02-22 70 test
22 2020-02-23 38 test
23 2020-02-24 38 NaN
24 2020-02-25 70 NaN
25 2020-02-26 70 NaN
26 2020-02-27 70 NaN

Making a Timeseries plot in Python, but want to skip few months

I have a timeseries data of ice thickness. The plot is only useful for winter months and there is no interest in seeing big gaps during summer months. Would it be possible to skip summer months (say April to October) in the x-axis and have a smaller area with different color and labeled Summer?
Let's take this data:
import datetime
n_samples = 20
index = pd.date_range(start='1/1/2018', periods=n_samples, freq='M')
values = np.random.randint(0,100, size=(n_samples))
data = pd.Series(values, index=index)
print(data)
2018-01-31 58
2018-02-28 93
2018-03-31 15
2018-04-30 87
2018-05-31 51
2018-06-30 67
2018-07-31 22
2018-08-31 66
2018-09-30 55
2018-10-31 73
2018-11-30 70
2018-12-31 61
2019-01-31 95
2019-02-28 97
2019-03-31 31
2019-04-30 50
2019-05-31 75
2019-06-30 80
2019-07-31 84
2019-08-31 19
Freq: M, dtype: int64
You can filter the data that is not in the range of the months, so you take the index of Serie, take the month, check if is in the range, and take the negative (with ~)
filtered1 = data[~data.index.month.isin(range(4,10))]
print(filtered1)
2018-01-31 58
2018-02-28 93
2018-03-31 15
2018-10-31 73
2018-11-30 70
2018-12-31 61
2019-01-31 95
2019-02-28 97
2019-03-31 31
If you plot that,
filtered1.plot()
you will have this image
so you need to set the frecuency, in this case, monthly (M)
filtered1.asfreq('M').plot()
Additionally, you could use filters like:
filtered2 = data[data.index.month.isin([1,2,3,11,12])]
filtered3 = data[~ data.index.month.isin([4,5,6,7,8,9,10])]
if you need keep/filter specific months.

Fill NaN values from its Previous Value pandas

I have below Data from the excel sheet and i want every NaN to be filled from Just its previous value even if its one or more NaN. I tried with ffill() method but doesn't solve the purpose because it takes very First value before NaN of the column and populated that to all NaN.
Could someone help pls.
My Dtaframe:
import pandas as pd
df = pd.read_excel("Example-sheat.xlsx",sheet_name='Sheet1')
#df = df.fillna(method='ffill')
#df = df['AuthenticZTed domaTT controller'].ffill()
print(df)
My DataFrame output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 NaN TTv1670
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 NaN xxgb-gen
8 ZTPGRKMIK1DC200.example.com TTva038
9 DU2RDCRDC1DC204.example.com TTv0071
10 NaN ttv0032
11 KT1MUC02DUDC201.example.com TTv0073
12 NaN TTv0679
13 TN1SZZ67DC200.example.com TTv1180
14 TT1NDZ45DC202.example.com TTv1181
15 TT1BLR01APDC200.example.com TTv0859
16 TN1SZZ67DC200.example.com xxg2089
17 NaN TTv1846
18 ZTPGRKMIK1DC200.example.com TTvtp064
19 PR1CPQ01DC200.example.com TTv0950
20 PR1CPQ01DC200.example.com TTc7005
21 NaN TTv0678
22 KT1MUC02DUDC201.example.com TTv257032798
23 PR1CPQ01DC200.example.com xxg2016
24 NaN TTv0313
25 TT1BLR01APDC200.example.com TTc4901
26 NaN TTv0710
27 DU2RDCRDC1DC204.example.com xxg3008
28 NaN TTv1080
29 PR1CPQ01DC200.example.com xxg2022
30 NaN xxg2057
31 NaN TTv1522
32 TN1SZZ67DC200.example.com TTv258998881
33 PR1CPQ01DC200.example.com TTv259064418
34 ZTPGRKMIK1DC200.example.com TTv259129955
35 TT1BLR01APDC200.example.com xxg2034
36 NaN TTv259326564
37 TNHSZPBCD2DC200.example.com TTv259129952
38 KT1MUC02DUDC201.example.com TTv259195489
39 ZTPGRKMIK1DC200.example.com TTv0683
40 ZTPGRKMIK1DC200.example.com TTv0885
41 TT1BLR01APDC200.example.com dbexh
42 NaN TTvtp065
43 TN1PEK01APDC200.example.com TTvtp057
44 ZTPGRKMIK1DC200.example.com TTvtp007
45 NaN TTvtp063
46 TT1BLR01APDC200.example.com TTvtp032
47 KTphbgsa11dc201.example.com TTvtp046
48 NaN TTvtp062
49 PR1CPQ01DC200.example.com TTv0235
50 NaN TTv0485
51 TT1NDZ45DC202.example.com TTv0236
52 NaN TTv0486
53 PR1CPQ01DC200.example.com TTv0237
54 NaN TTv0487
55 TT1BLR01APDC200.example.com TTv0516
56 TN1CQI02DC200.example.com TTv1285
57 TN1PEK01APDC200.example.com TTv0440
58 NaN liv9007
59 HR1GDL28DC200.example.com TTv0445
60 NaN tuv006
61 FTGFTPTP34DC203.example.com TTv0477
62 NaN tuv002
63 TN1CQI02DC200.example.com TTv0534
64 TN1SZZ67DC200.example.com TTv0639
65 NaN TTv0825
66 NaN TTv1856
67 TT1BLR01APDC200.example.com TTva101
68 TN1SZZ67DC200.example.com TTv1306
69 KTphbgsa11dc201.example.com TTv1072
70 NaN webx02
71 KT1MUC02DUDC201.example.com TTv1310
72 PR1CPQ01DC200.example.com TTv1151
73 TN1CQI02DC200.example.com TTv1165
74 NaN tuv90
75 TN1SZZ67DC200.example.com TTv1065
76 KTphbgsa11dc201.example.com TTv1737
77 NaN ramn01
78 HR1GDL28DC200.example.com ramn02
79 NaN ptb001
80 HR1GDL28DC200.example.com ptn002
81 NaN ptn003
82 TN1SZZ67DC200.example.com TTs0057
83 PR1CPQ01DC200.example.com TTs0058
84 NaN TTs0058-duplicZTe
85 PR1CPQ01DC200.example.com xxg2080
86 KTphbgsa11dc204.example.com xxg2081
87 TN1PEK01APDC200.example.com xxg2082
88 NaN xxg3002
89 TN1SZZ67DC200.example.com xxg2084
90 NaN xxg3005
91 ZTPGRKMIK1DC200.example.com xxg2086
92 NaN xxg3007
93 KT1MUC02DUDC201.example.com xxg2098
94 NaN xxg3014
95 TN1PEK01APDC200.example.com xxg2026
96 NaN xxg2094
97 TN1PEK01APDC200.example.com livtp005
98 KT1MUC02DUDC201.example.com xxg2059
99 ZTPGRKMIK1DC200.example.com acc9102
100 NaN xxg2111
101 TN1CQI02DC200.example.com xxgtp009
Desired Output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 TT1NDZ45DC202.example.com TTv1670 <---
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 DU2RDCRDC1DC204.example.com xxgb-gen <---
1- You are already close to your solution, just use shift() with ffill() it should work.
df = df.apply(lambda x: x.fillna(df['AuthenticZTed domaTT controller']).shift()).ffill()
2- As Quang Suggested that in the comments aso works..
df['AuthenticZTed domaTT controller'] = df['AuthenticZTed domaTT controller'].ffill()
3- or you can also try follows
df = df.fillna({var: df['AuthenticZTed domaTT controller'].shift() for var in df}).ffill()
4- other way around you can define a cols variable if you have multiple columns and then loop through it.
cols = ['AuthenticZTed domaTT controller', 'source KTvice naHR']
for cols in df.columns:
df[cols] = df[cols].ffill()
print(df)
OR
df.loc[:,cols] = df.loc[:,cols].ffill()

Pandas: how to drop the lowest 5th percentile for each indexed group?

I have the following issue with python pandas (I am relatively new to it): I have a simple dataset with a column for date, and a corresponding column of values. I am able to sort this Dataframe by date and value by doing the following:
df = df.sort_values(['date', 'value'],ascending=False)
I obtain this:
date value
2019-11 100
2019-11 89
2019-11 87
2019-11 86
2019_11 45
2019_11 33
2019_11 24
2019_11 11
2019_11 8
2019_11 5
2019-10 100
2019-10 98
2019-10 96
2019-10 94
2019_10 94
2019_10 78
2019_10 74
2019_10 12
2019_10 3
2019_10 1
Now, what I would like to do, is to get rid of the lowest fifth percentile for the value column for EACH month (each group). I know that I should use a groupby method, and perhaps also a function:
df = df.sort_values(['date', 'value'],ascending=False).groupby('date', group_keys=False).apply(<???>)
The ??? is where I am struggling. I know how to suppress the lowest 5th percentile on a sorted Dataframe as a WHOLE, for instance by doing:
df = df[df.value > df.value.quantile(.05)]
This was the object of another post on StackOverflow. I know that I can also use numpy to do this, and that it is much faster, but my issue is really how to apply that to EACH GROUP independently (each portion of the value column sorted by month) in the Dataframe, not just the whole Dataframe.
Any help would be greatly appreciated
Thank you so very much,
Kind regards,
Berti
Use GroupBy.transform with lambda function for Series with same size like original DataFrame, so possible filter by boolean indexing:
df = df.sort_values(['date', 'value'],ascending=False)
q = df.groupby('date')['value'].transform(lambda x: x.quantile(.05))
df = df[df.value > q]
print (df)
date value
4 2019_11 45
5 2019_11 33
6 2019_11 24
7 2019_11 11
8 2019_11 8
14 2019_10 94
15 2019_10 78
16 2019_10 74
17 2019_10 12
18 2019_10 3
0 2019-11 100
1 2019-11 89
2 2019-11 87
10 2019-10 100
11 2019-10 98
12 2019-10 96
You could create your own function and apply it:
def remove_bottom_5_pct(arr):
thresh = np.percentile(arr, 5)
return arr[arr > thresh]
df.groupby('date', sort=False)['value'].apply(remove_bottom_5_pct)
[out]
date
2019-11 0 100
1 89
2 87
3 86
4 45
5 33
6 24
7 11
8 8
2019-10 10 100
11 98
12 96
13 94
14 94
15 78
16 74
17 12
18 3
Name: value, dtype: int64

How to calculate Volume Weighted Average Price (VWAP) using a pandas dataframe with ask and bid price?

How do i create another column called vwap which calculates the vwap if my table is as shown below?
time bid_size bid ask ask_size trade trade_size phase
0 2019-01-07 07:45:01.064515 495 152.52 152.54 19 NaN NaN OPEN
1 2019-01-07 07:45:01.110072 31 152.53 152.54 19 NaN NaN OPEN
2 2019-01-07 07:45:01.116596 32 152.53 152.54 19 NaN NaN OPEN
3 2019-01-07 07:45:01.116860 32 152.53 152.54 21 NaN NaN OPEN
4 2019-01-07 07:45:01.116905 34 152.53 152.54 21 NaN NaN OPEN
5 2019-01-07 07:45:01.116982 34 152.53 152.54 31 NaN NaN OPEN
6 2019-01-07 07:45:01.147901 38 152.53 152.54 31 NaN NaN OPEN
7 2019-01-07 07:45:01.189971 38 152.53 152.54 31 ask 15.0 OPEN
8 2019-01-07 07:45:01.189971 38 152.53 152.54 16 NaN NaN OPEN
9 2019-01-07 07:45:01.190766 37 152.53 152.54 16 NaN NaN OPEN
10 2019-01-07 07:45:01.190856 37 152.53 152.54 15 NaN NaN OPEN
11 2019-01-07 07:45:01.190856 37 152.53 152.54 16 ask 1.0 OPEN
12 2019-01-07 07:45:01.193938 37 152.53 152.55 108 NaN NaN OPEN
13 2019-01-07 07:45:01.193938 37 152.53 152.54 15 ask 15.0 OPEN
14 2019-01-07 07:45:01.194326 2 152.54 152.55 108 NaN NaN OPEN
15 2019-01-07 07:45:01.194453 2 152.54 152.55 97 NaN NaN OPEN
16 2019-01-07 07:45:01.194479 6 152.54 152.55 97 NaN NaN OPEN
17 2019-01-07 07:45:01.194507 19 152.54 152.55 97 NaN NaN OPEN
18 2019-01-07 07:45:01.194532 19 152.54 152.55 77 NaN NaN OPEN
19 2019-01-07 07:45:01.194598 19 152.54 152.55 79 NaN NaN OPEN
Sorry, the table is not clear, but the second most right column is trade_size, on its left is trade, which shows the side of the trade( bid or ask). if both trade_size and trade are NaN, it indicates that no trade occur at that timestamp.
If df['trade'] == "ask", trade price will be the price in column 'ask' and if df['trade] == "bid", the trade price will be the price in column 'bid'. Since there are 2 prices, may I ask how can i calculate the vwap, df['vwap']?
My idea is to use np.cumsum().
You can use np.where to give you the price from the correct column (bid or ask) depending on the value in the trade column. Note that this gives you the bid price when no trade occurs, but because this is then multiplied by a NaN trade size it won't matter. I also forward filled the VWAP.
volume = df['trade_size']
price = np.where(df['trade'].eq('ask'), df['ask'], df['bid'])
df = df.assign(VWAP=((volume * price).cumsum() / vol.cumsum()).ffill())
>>> df
time bid_size bid ask ask_size trade trade_size phase VWAP
0 2019-01-07 07:45:01.064515 495 152.52 152.54 19 NaN NaN OPEN NaN
1 2019-01-07 07:45:01.110072 31 152.53 152.54 19 NaN NaN OPEN NaN
2 2019-01-07 07:45:01.116596 32 152.53 152.54 19 NaN NaN OPEN NaN
3 2019-01-07 07:45:01.116860 32 152.53 152.54 21 NaN NaN OPEN NaN
4 2019-01-07 07:45:01.116905 34 152.53 152.54 21 NaN NaN OPEN NaN
5 2019-01-07 07:45:01.116982 34 152.53 152.54 31 NaN NaN OPEN NaN
6 2019-01-07 07:45:01.147901 38 152.53 152.54 31 NaN NaN OPEN NaN
7 2019-01-07 07:45:01.189971 38 152.53 152.54 31 ask 15.0 OPEN 152.54
8 2019-01-07 07:45:01.189971 38 152.53 152.54 16 NaN NaN OPEN 152.54
9 2019-01-07 07:45:01.190766 37 152.53 152.54 16 NaN NaN OPEN 152.54
10 2019-01-07 07:45:01.190856 37 152.53 152.54 15 NaN NaN OPEN 152.54
11 2019-01-07 07:45:01.190856 37 152.53 152.54 16 ask 1.0 OPEN 152.54
12 2019-01-07 07:45:01.193938 37 152.53 152.55 108 NaN NaN OPEN 152.54
13 2019-01-07 07:45:01.193938 37 152.53 152.54 15 ask 15.0 OPEN 152.54
14 2019-01-07 07:45:01.194326 2 152.54 152.55 108 NaN NaN OPEN 152.54
15 2019-01-07 07:45:01.194453 2 152.54 152.55 97 NaN NaN OPEN 152.54
16 2019-01-07 07:45:01.194479 6 152.54 152.55 97 NaN NaN OPEN 152.54
17 2019-01-07 07:45:01.194507 19 152.54 152.55 97 NaN NaN OPEN 152.54
18 2019-01-07 07:45:01.194532 19 152.54 152.55 77 NaN NaN OPEN 152.54
19 2019-01-07 07:45:01.194598 19 152.54 152.55 79 NaN NaN OPEN 152.54
Here is one possible approach
Append VMAP column full of NaNs
df['VMAP'] = np.nan
Calculate VMAP (based on this equation provided by the OP) and assign values based on ask or bid, as requierd by the OP
for trade in ['ask','bid']:
# Find indexes of `ask` or `buy`
bid_idx = df[df.trade==trade].index
# Slice DF based on `ask` or `buy`, using indexes
df.loc[bid_idx, 'VMAP'] = (
(df.loc[bid_idx, 'trade_size'] * df.loc[bid_idx, trade]).cumsum()
/
(df.loc[bid_idx, 'trade_size']).cumsum()
)
print(df.iloc[:,1:])
time bid_size bid ask ask_size trade trade_size phase VMAP
0 07:45:01.064515 495 152.52 152.54 19 NaN NaN OPEN NaN
1 07:45:01.110072 31 152.53 152.54 19 NaN NaN OPEN NaN
2 07:45:01.116596 32 152.53 152.54 19 NaN NaN OPEN NaN
3 07:45:01.116860 32 152.53 152.54 21 NaN NaN OPEN NaN
4 07:45:01.116905 34 152.53 152.54 21 NaN NaN OPEN NaN
5 07:45:01.116982 34 152.53 152.54 31 NaN NaN OPEN NaN
6 07:45:01.147901 38 152.53 152.54 31 NaN NaN OPEN NaN
7 07:45:01.189971 38 152.53 152.54 31 ask 15.0 OPEN 152.54
8 07:45:01.189971 38 152.53 152.54 16 NaN NaN OPEN NaN
9 07:45:01.190766 37 152.53 152.54 16 NaN NaN OPEN NaN
10 07:45:01.190856 37 152.53 152.54 15 NaN NaN OPEN NaN
11 07:45:01.190856 37 152.53 152.54 16 ask 1.0 OPEN 152.54
12 07:45:01.193938 37 152.53 152.55 108 NaN NaN OPEN NaN
13 07:45:01.193938 37 152.53 152.54 15 ask 15.0 OPEN 152.54
14 07:45:01.194326 2 152.54 152.55 108 NaN NaN OPEN NaN
15 07:45:01.194453 2 152.54 152.55 97 NaN NaN OPEN NaN
16 07:45:01.194479 6 152.54 152.55 97 NaN NaN OPEN NaN
17 07:45:01.194507 19 152.54 152.55 97 NaN NaN OPEN NaN
18 07:45:01.194532 19 152.54 152.55 77 NaN NaN OPEN NaN
19 07:45:01.194598 19 152.54 152.55 79 NaN NaN OPEN NaN
EDIT
As #edinho correctly indicated, the VMAP is the same as the trade_price column.
Ok, here it is
df['trade_price'] = df.apply(lambda x: x['bid'] if x['trade']=='bid' else x['ask'], axis=1)
df['vwap'] = (df['trade_price'] * df['trade_size']).cumsum() / df['trade_size'].fillna(0).cumsum()
The first line:
It saves the trade_price in a new column, so it is easier to retrieve it later.
If you want, you can delete this line and make a function (maybe it is easier to read). But I prefer to see the intermediary results.
Q: why it has values even when there is no trade?
A: because of the way the lambda is written. The else captures the ask price. But it won't make a difference, because of the next step.
Second line:
Here the real calculation takes places.
The first part calculate the total volume traded until that moment (as you said, using cumulative sums makes life easier).
The second part calculates the total volume traded until that moment (again, cumulative sums).
If you want, you can break this line and make more intermediary columns.
Q: why the fillna(0)?
A: so the total volume don't get NaNs and you don't get a division error
Q: why so many NaNs in the vwap column?
A: Because of the lines that don't have trade. You can fill them with 0s, but would be better to keep the 'no trade' information.
Ps.: you may get a wrong result as it is considering volume and price only in the same direction. But, you could try to invert some signal to fix the volume in the way you expect (for instance: changing the ask price to negative).
and this code output:
trade_price vwap
1 152.54 NaN
2 152.54 NaN
3 152.54 NaN
4 152.54 NaN
5 152.54 NaN
6 152.54 NaN
7 152.54 NaN
8 152.54 152.54
9 152.54 NaN
10 152.54 NaN
11 152.54 NaN
12 152.54 152.54
13 152.55 NaN
14 152.54 152.54
15 152.55 NaN
16 152.55 NaN
17 152.55 NaN
18 152.55 NaN
19 152.55 NaN
20 152.55 NaN

Resources