Issue with datetime formatting - python-3.x

I am having an issue with the datetime format of a set of data. The issue is due to the hour of day ranging from 1-24, with the 24th hour set to the wrong day (more specifically, the previous day). I have a sample of the data below,
1/1/2019,14:00,0.2,0.1,0.0,0.2,3.0,36.7,3,153
1/1/2019,15:00,0.2,0.6,0.2,0.4,3.9,36.7,1,199
1/1/2019,16:00,1.8,2.4,0.8,1.6,1.1,33.0,0,307
1/1/2019,17:00,3.0,3.2,0.6,2.6,6.0,32.8,1,310
1/1/2019,18:00,1.6,2.2,0.5,1.7,7.9,33.1,4,293
1/1/2019,19:00,1.7,1.1,0.6,0.6,5.9,35.0,5,262
1/1/2019,20:00,1.0,0.5,0.2,0.2,2.9,32.6,5,201
1/1/2019,21:00,0.6,0.3,0.0,0.4,2.1,31.8,6,182
1/1/2019,22:00,0.4,0.3,0.0,0.4,5.1,31.4,6,187
1/1/2019,23:00,0.8,0.6,0.3,0.3,9.9,30.2,5,227
1/1/2019,24:00,1.0,0.7,0.3,0.4,6.9,27.9,4,225 --- Here the date should be 1/2/2019
1/2/2019,01:00,1.3,0.9,0.5,0.4,4.0,26.9,6,236
1/2/2019,02:00,0.4,0.4,0.2,0.2,5.0,27.3,6,168
1/2/2019,03:00,0.7,0.5,0.3,0.3,6.9,30.2,4,219
1/2/2019,04:00,1.3,0.8,0.5,0.3,5.9,32.3,4,242
1/2/2019,05:00,0.7,0.2,0.0,0.2,3.0,33.8,4,177
1/2/2019,06:00,0.5,0.2,0.2,0.1,5.1,36.1,4,195
1/2/2019,07:00,0.6,0.3,0.2,0.2,9.9,38.0,4,200
1/2/2019,08:00,0.5,0.6,0.4,0.3,6.8,38.9,4,179
1/2/2019,09:00,0.5,0.2,0.0,0.2,3.0,39.0,4,193
1/2/2019,10:00,0.3,0.3,0.2,0.1,4.0,38.7,5,198
1/2/2019,11:00,0.3,0.3,0.2,0.0,4.9,38.4,5,170
1/2/2019,12:00,0.6,0.3,0.3,0.0,2.0,38.4,4,172
1/2/2019,13:00,0.2,0.3,0.2,0.0,2.0,38.8,4,154
1/2/2019,14:00,0.3,0.1,0.0,0.2,1.9,39.3,4,145
This is a fairly large set of data which I need to make a time series plot of, and as such I need to find a way to fix this formatting issue. I was attempting to iterate through the rows and in a pandas dataframe to fix problematic rows, but this does not provide any results. Thank you for any help beforehand.

You can convert date to datetimes by to_datetime and then add time column converted to timedeltas by to_timedelta:
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['time'] + ':00')
Or if need remove time column also:
print (df)
date time a b c d e f g h
0 1/1/2019 14:00 0.2 0.1 0.0 0.2 3.0 36.7 3 153
1 1/1/2019 15:00 0.2 0.6 0.2 0.4 3.9 36.7 1 199
2 1/1/2019 16:00 1.8 2.4 0.8 1.6 1.1 33.0 0 307
3 1/1/2019 17:00 3.0 3.2 0.6 2.6 6.0 32.8 1 310
4 1/1/2019 18:00 1.6 2.2 0.5 1.7 7.9 33.1 4 293
5 1/1/2019 19:00 1.7 1.1 0.6 0.6 5.9 35.0 5 262
6 1/1/2019 20:00 1.0 0.5 0.2 0.2 2.9 32.6 5 201
7 1/1/2019 21:00 0.6 0.3 0.0 0.4 2.1 31.8 6 182
8 1/1/2019 22:00 0.4 0.3 0.0 0.4 5.1 31.4 6 187
9 1/1/2019 23:00 0.8 0.6 0.3 0.3 9.9 30.2 5 227
10 1/1/2019 24:00 1.0 0.7 0.3 0.4 6.9 27.9 4 225
11 1/2/2019 01:00 1.3 0.9 0.5 0.4 4.0 26.9 6 236
12 1/2/2019 02:00 0.4 0.4 0.2 0.2 5.0 27.3 6 168
13 1/2/2019 03:00 0.7 0.5 0.3 0.3 6.9 30.2 4 219
14 1/2/2019 04:00 1.3 0.8 0.5 0.3 5.9 32.3 4 242
15 1/2/2019 05:00 0.7 0.2 0.0 0.2 3.0 33.8 4 177
16 1/2/2019 06:00 0.5 0.2 0.2 0.1 5.1 36.1 4 195
17 1/2/2019 07:00 0.6 0.3 0.2 0.2 9.9 38.0 4 200
18 1/2/2019 08:00 0.5 0.6 0.4 0.3 6.8 38.9 4 179
19 1/2/2019 09:00 0.5 0.2 0.0 0.2 3.0 39.0 4 193
20 1/2/2019 10:00 0.3 0.3 0.2 0.1 4.0 38.7 5 198
21 1/2/2019 11:00 0.3 0.3 0.2 0.0 4.9 38.4 5 170
22 1/2/2019 12:00 0.6 0.3 0.3 0.0 2.0 38.4 4 172
23 1/2/2019 13:00 0.2 0.3 0.2 0.0 2.0 38.8 4 154
24 1/2/2019 14:00 0.3 0.1 0.0 0.2 1.9 39.3 4 145
df['date'] = pd.to_datetime(df['date']) + pd.to_timedelta(df.pop('time') + ':00')
print (df)
date a b c d e f g h
0 2019-01-01 14:00:00 0.2 0.1 0.0 0.2 3.0 36.7 3 153
1 2019-01-01 15:00:00 0.2 0.6 0.2 0.4 3.9 36.7 1 199
2 2019-01-01 16:00:00 1.8 2.4 0.8 1.6 1.1 33.0 0 307
3 2019-01-01 17:00:00 3.0 3.2 0.6 2.6 6.0 32.8 1 310
4 2019-01-01 18:00:00 1.6 2.2 0.5 1.7 7.9 33.1 4 293
5 2019-01-01 19:00:00 1.7 1.1 0.6 0.6 5.9 35.0 5 262
6 2019-01-01 20:00:00 1.0 0.5 0.2 0.2 2.9 32.6 5 201
7 2019-01-01 21:00:00 0.6 0.3 0.0 0.4 2.1 31.8 6 182
8 2019-01-01 22:00:00 0.4 0.3 0.0 0.4 5.1 31.4 6 187
9 2019-01-01 23:00:00 0.8 0.6 0.3 0.3 9.9 30.2 5 227
10 2019-01-02 00:00:00 1.0 0.7 0.3 0.4 6.9 27.9 4 225
11 2019-01-02 01:00:00 1.3 0.9 0.5 0.4 4.0 26.9 6 236
12 2019-01-02 02:00:00 0.4 0.4 0.2 0.2 5.0 27.3 6 168
13 2019-01-02 03:00:00 0.7 0.5 0.3 0.3 6.9 30.2 4 219
14 2019-01-02 04:00:00 1.3 0.8 0.5 0.3 5.9 32.3 4 242
15 2019-01-02 05:00:00 0.7 0.2 0.0 0.2 3.0 33.8 4 177
16 2019-01-02 06:00:00 0.5 0.2 0.2 0.1 5.1 36.1 4 195
17 2019-01-02 07:00:00 0.6 0.3 0.2 0.2 9.9 38.0 4 200
18 2019-01-02 08:00:00 0.5 0.6 0.4 0.3 6.8 38.9 4 179
19 2019-01-02 09:00:00 0.5 0.2 0.0 0.2 3.0 39.0 4 193
20 2019-01-02 10:00:00 0.3 0.3 0.2 0.1 4.0 38.7 5 198
21 2019-01-02 11:00:00 0.3 0.3 0.2 0.0 4.9 38.4 5 170
22 2019-01-02 12:00:00 0.6 0.3 0.3 0.0 2.0 38.4 4 172
23 2019-01-02 13:00:00 0.2 0.3 0.2 0.0 2.0 38.8 4 154
24 2019-01-02 14:00:00 0.3 0.1 0.0 0.2 1.9 39.3 4 145

Related

Comparing one month's value of current year with previous year values adding or substracting multiple parameters

Given a following dataframe df:
date mom_pct
0 2020-1-31 1.4
1 2020-2-29 0.8
2 2020-3-31 -1.2
3 2020-4-30 -0.9
4 2020-5-31 -0.8
5 2020-6-30 -0.1
6 2020-7-31 0.6
7 2020-8-31 0.4
8 2020-9-30 0.2
9 2020-10-31 -0.3
10 2020-11-30 -0.6
11 2020-12-31 0.7
12 2021-1-31 1.0
13 2021-2-28 0.6
14 2021-3-31 -0.5
15 2021-4-30 -0.3
16 2021-5-31 -0.2
17 2021-6-30 -0.4
18 2021-7-31 0.3
19 2021-8-31 0.1
20 2021-9-30 0.0
21 2021-10-31 0.7
22 2021-11-30 0.4
23 2021-12-31 -0.3
24 2022-1-31 0.4
25 2022-2-28 0.6
26 2022-3-31 0.0
27 2022-4-30 0.4
28 2022-5-31 -0.2
I want to compare the chain ratio value of a month of the current year to the value of the month of the previous year. Assume that the value of the same period last year is y_t-1, and the current value of this year is y_t. I will create a new column according to the following rules:
If y_t = y_t-1, returns 0 for new column;
If y_t ∈ (y_t-1, y_t-1 + 0.3], returns 1;
If y_t ∈ (y_t-1 + 0.3, y_t-1 + 0.5], returns 2;
If y_t > (y_t-1 + 0.5), returns 3;
If y_t ∈ [y_t-1 - 0.3, y_t-1), returns -1;
If y_t ∈ [y_t-1 - 0.5, y_t-1 - 0.3), returns -2;
If y_t < (y_t-1 - 0.5), returns -3
The expected result:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0
I attempt to create multiple columns and ranges, then check mom_pct is in which range. Is it possible to do that in a more effecient way? Thanks.
df1['mom_pct_zero'] = df1['mom_pct'].shift(12)
df1['mom_pct_pos1'] = df1['mom_pct'].shift(12) + 0.3
df1['mom_pct_pos2'] = df1['mom_pct'].shift(12) + 0.5
df1['mom_pct_neg1'] = df1['mom_pct'].shift(12) - 0.3
df1['mom_pct_neg2'] = df1['mom_pct'].shift(12) - 0.5
I would do it as follows
def categorize(v):
if np.isnan(v) or v == 0.:
return v
sign = -1 if v < 0 else 1
eps = 1e-10
if abs(v) <= 0.3 + eps:
return sign * 1
if abs(v) <= 0.5 + eps:
return sign * 2
return sign * 3
df['categorial_mom_pct'] = df['mom_pct'].diff(12).map(categorize)
print(df)
Note that I added a very small eps to the threshold to counter the precision issue with floating point arithmetic
abs(-0.3) <= 0.3 # True
abs(-0.4 + 0.1) <= 0.3 # False
abs(-0.4 + 0.1) <= 0.3 + 1e-10 # True
Out:
date mom_pct categorial_mom_pct
0 2020-1-31 1.0 NaN
1 2020-2-29 0.8 NaN
2 2020-3-31 -1.2 NaN
3 2020-4-30 -0.9 NaN
4 2020-5-31 -0.8 NaN
5 2020-6-30 -0.1 NaN
6 2020-7-31 0.6 NaN
7 2020-8-31 0.4 NaN
8 2020-9-30 0.2 NaN
9 2020-10-31 -0.3 NaN
10 2020-11-30 -0.6 NaN
11 2020-12-31 0.7 NaN
12 2021-1-31 1.0 0.0
13 2021-2-28 0.6 -1.0
14 2021-3-31 -0.5 3.0
15 2021-4-30 -0.3 3.0
16 2021-5-31 -0.2 3.0
17 2021-6-30 -0.4 -1.0
18 2021-7-31 0.3 -1.0
19 2021-8-31 0.1 -1.0
20 2021-9-30 0.0 -1.0
21 2021-10-31 0.7 3.0
22 2021-11-30 0.4 3.0
23 2021-12-31 -0.3 -3.0
24 2022-1-31 0.4 -3.0
25 2022-2-28 0.6 0.0
26 2022-3-31 0.0 2.0
27 2022-4-30 0.4 3.0
28 2022-5-31 -0.2 0.0

ValueError: setting an array element with a sequence when creating new columns in pandas

I am trying to create a new column Trend by writing this code
df_cal['Trend'] = np.where((df_cal['75% Quantile'] > df_cal['Shift 75% Quantile']) & (df_cal['25% Quantile'] > df_cal['Shift 25% Quantile']), "Up",
np.where(df_cal['75% Quantile'] < df_cal['Shift 75% Quantile']) & (df_cal['25% Quantile'] < df_cal['Shift 25% Quantile']), "Down","Flat")
However when I run the code it give me this error
ValueError: setting an array element with a sequence.
Is there anyway to solve this?
A sample of the table is shown below.
DateTime 75% Quantile 25% Quantile Shift 75% Quantile Shift 25% Quantile
0 2020-12-18 15:00 2.0 -4.0 NaN NaN
1 2020-12-18 16:00 4.0 -4.0 2.0 -4.0
2 2020-12-18 17:00 -4.0 -10.0 4.0 -4.0
3 2020-12-18 18:00 8.0 8.0 -4.0 -10.0
4 2020-12-18 19:00 0.0 -4.0 8.0 8.0
5 2020-12-18 20:00 0.0 0.0 0.0 -4.0
6 2020-12-19 08:00 8.0 8.0 0.0 0.0
7 2020-12-19 09:00 -2.0 -6.0 8.0 8.0
8 2020-12-19 10:00 4.0 -8.0 -2.0 -6.0
9 2020-12-19 11:00 0.0 -4.0 4.0 -8.0
10 2020-12-19 12:00 4.0 -4.0 0.0 -4.0
Explanation
I use pandas.mask() to achieve the conversion you need
Source Code
up_cond = (df_cal['75% Quantile'] > df_cal['Shift 75% Quantile'])
& (df_cal['25% Quantile'] > df_cal['Shift 25% Quantile'])
down_cond = (df_cal['75% Quantile'] < df_cal['Shift 75% Quantile'])
& (df_cal['25% Quantile'] < df_cal['Shift 25% Quantile'])
df_cal['Trend'] = 'Flat'
df_cal['Trend'] = df_cal['Trend'].mask(up_cond, "Up")
df_cal['Trend'] = df_cal['Trend'].mask(down_cond, "Down")
Result

Parse commented data from Python requests with BSoup

I am trying to parse some data from Basketball-Reference, but so far I'm unable to do so. Here is my code for getting the raw html data
import requests
from bs4 import BeautifulSoup
url='https://www.basketball-reference.com/teams/DAL/2021/lineups/'
response=requests.get(url=url)
soup=BeautifulSoup(response.content,'html.parser')
>>soup.find(attrs={'id':"all_lineups_5-man_"}).find('table')
That last line gives an error, when it shouldn't. My guess is that it is happening because of the <!-- highlighted in yellow in the picture below. So my question is, how should I approach this?
You could loop through the comments, grab the tables, and then use pandas.
For example:
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
from tabulate import tabulate
url = 'https://www.basketball-reference.com/teams/DAL/2021/lineups/'
response = requests.get(url)
soup = BeautifulSoup(
response.content, 'html.parser'
).find_all(text=lambda text: isinstance(text, Comment))
tables = [c for c in soup if "<div" in c]
frames = [pd.read_html(table, flavor="bs4") for table in tables]
print(tabulate(pd.concat(frames[-1])))
pd.concat(frames[-1]).to_csv("table_4.csv", index=False)
Output:
-- --- ------------------------------- ------ ----- ---- ---- ------ ---- ---- ------ ------ ----- ---- ------ ---- ----- ---- ----- ---- ----- ---- ---- ---- ---- ----
0 1 L. Dončić | T. Hardaway 435:01 1.9 1.8 -0.5 0.023 0.3 2.1 -0.012 0.025 -2 -3 0.012 -2.1 -3.9 -0.5 -3.9 -1.5 -3.2 0.8 -0.4 1.2 -1.1 -0.9
1 2 W. Cauley-Stein | L. Dončić 288:50 8.3 5.2 0.4 0.057 0.7 -1.3 0.032 0.06 -2.9 -3.8 -0.003 -0.6 1.2 5.3 1.2 2.5 5.5 0.6 -1.6 3 0.5 -0.8
2 3 L. Dončić | J. Richardson 266:22 -1.5 0 4.2 -0.023 -0.5 6.1 -0.071 -0.029 -1.1 -1 -0.01 -0.7 -3 -3.7 -3 -2.2 -5 -1.4 2.7 0.8 -4.4 -1.1
3 4 D. Finney-Smith | J. Richardson 252:38 -3.3 -0.5 5.1 -0.034 -0.4 5.8 -0.06 -0.04 -1.8 -2 -0.012 0.2 -1 -3.4 -1 -1.5 -3.5 0.3 0.2 0 -3.9 -1.7
4 5 T. Burke | L. Dončić 251:51 -2.2 3.2 1.1 0.03 -1.4 1.6 -0.052 0.021 -7.2 -9.5 -0.008 -3.8 -7 -1.4 -7 -2.8 -6.1 -3.6 0.3 1 -2.5 1.6
5 6 J. Brunson | T. Hardaway 246:37 -4.4 -2.9 -3.9 -0.012 -3.2 -3.8 -0.05 -0.027 4.5 3.7 0.06 -3.8 -8.1 -2.5 -8.1 -3.2 -7.2 -0.6 -1.2 0 -1.2 -1.6
6 7 T. Burke | J. Johnson 242:07 -2.5 2.3 -0.6 0.029 0.2 0.2 0.004 0.03 -7.2 -8.5 -0.042 -3.6 -7.1 -2.1 -7.1 -2.8 -6.2 -2.1 0 1.9 0 2.8
7 8 L. Dončić | D. Finney-Smith 236:07 -3.5 1.6 6.7 -0.019 -0.2 5.3 -0.055 -0.026 -6.4 -6.4 -0.064 -0.4 -2 -2.7 -2 -1.6 -3.6 0.2 0.9 -0.4 -3.1 -0.3
8 9 T. Hardaway | J. Richardson 230:38 -8.2 -0.7 7.3 -0.048 -0.8 8.8 -0.103 -0.059 -6 -5.8 -0.069 0.2 -2.7 -7.2 -2.7 -3.6 -8.3 -1.1 1 0.7 -5.3 0.1
9 10 L. Dončić | J. Johnson 229:11 -7.7 2.1 -1 0.028 -1.9 -1.8 -0.035 0.018 -10 -9.7 -0.11 -4.7 -9.1 -3 -9.1 -3.7 -7.8 -2.8 -0.9 0.8 -0.3 2.1
10 11 D. Finney-Smith | T. Hardaway 225:53 -1 3.8 10.2 -0.011 0 4.5 -0.04 -0.02 -8.6 -8.8 -0.077 0.9 -0.6 -5.2 -0.6 -2.2 -5.1 2.3 0.7 -0.9 -5.1 -0.1
11 12 W. Cauley-Stein | T. Hardaway 215:03 12.3 6 -0.9 0.071 1.8 -4.1 0.085 0.082 -1.4 -3.8 0.056 -1.9 -1.2 4.4 -1.2 1.6 3.4 3.9 -1.3 2.7 -0.4 0.1
12 13 T. Hardaway | J. Johnson 214:53 -2.7 0.3 -4.9 0.029 -0.1 -6.5 0.06 0.033 -3.1 -3.6 -0.017 -4.9 -9.3 -2 -9.3 -3.5 -7.8 -2 -0.2 1.2 1 2.1
13 14 J. Brunson | J. Johnson 190:02 -11.5 -1.6 -3.6 0.001 -4.7 -8.1 -0.052 -0.024 -3.7 -2.3 -0.081 -4.7 -9.6 -3.4 -9.6 -4 -8.7 -2.3 -1.9 0.5 0.5 1.8
14 15 T. Hardaway | K. Porziņģis 188:30 -12.7 -6.6 -3.2 -0.055 -1.1 4.8 -0.074 -0.059 1.7 -2.9 0.192 -3.2 -7.3 -5 -7.3 -4.5 -9 -0.6 -3.2 0.1 0.8 -1.2
15 16 L. Dončić | K. Porziņģis 181:14 -8.8 -2.5 -2.1 -0.017 -2.6 3.3 -0.1 -0.03 -1.1 -6.7 0.19 -2.5 -4.9 -1.7 -4.9 -2.6 -5.4 0.2 -4.1 1.5 1.8 0
16 17 T. Hardaway | D. Powell 178:57 -3.2 -0.7 2.1 -0.019 2 3.1 0.025 -0.009 -3.9 -3 -0.071 -2.4 -6.7 -6.1 -6.7 -4.3 -9.7 -0.7 1.2 0.3 -1.5 -0.1
17 18 T. Burke | T. Hardaway 177:36 -2.2 -2.4 -3.7 -0.008 2 2.8 0.027 0.006 0.5 -0.9 0.047 -3.3 -6 -1.8 -6 -3 -6.2 -6.2 2 1.3 0.7 0.1
18 19 J. Brunson | L. Dončić 165:52 -3.6 -0.8 -7.2 0.029 -4.4 -5.7 -0.077 0.008 2.5 3.1 0 -6.2 -11.6 -0.7 -11.6 -3.5 -7.8 0.2 -1.4 1.8 1.1 -1.6
19 20 L. Dončić | D. Powell 162:30 -5 0.3 2.3 -0.009 1 3 -0.002 -0.005 -6.6 -4.3 -0.119 -4.3 -11.1 -7.3 -11.1 -5.8 -13.2 0.7 2.3 0.7 -4.3 0.2
20 nan Team Average 965:54 -2.1 0.4 0.6 0.002 -0.7 0.8 -0.025 -0.002 -2.2 -2.3 -0.021 -2.1 -4.7 -2.3 -4.7 -2.2 -4.9 -0.4 -0.6 0.5 -1.8 -0.3
-- --- ------------------------------- ------ ----- ---- ---- ------ ---- ---- ------ ------ ----- ---- ------ ---- ----- ---- ----- ---- ----- ---- ---- ---- ---- ----
Sample output for a .csv file:

Getting total count of a another column using a specific day in a month?

VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count PULocationID DOLocationID fare_amount
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0 238 239 6.0
1 1.0 2020-01-01 00:35:39 2020-01-01 00:43:04 1.0 239 238 7.0
2 1.0 2020-01-01 00:47:41 2020-01-01 00:53:52 1.0 238 238 6.0
3 1.0 2020-01-01 00:55:23 2020-01-01 01:00:14 1.0 238 151 5.5
4 2.0 2020-01-01 00:01:58 2020-01-01 00:04:16 1.0 193 193 3.5
5 2.0 2020-01-01 00:09:44 2020-01-01 00:10:37 1.0 7 193 2.5
6 2.0 2020-01-01 00:39:25 2020-01-01 00:39:29 1.0 193 193 2.5
7 1.0 2020-01-01 00:29:01 2020-01-01 00:40:28 2.0 246 48 8.0
8 1.0 2020-01-01 00:55:11 2020-01-01 01:12:03 2.0 246 79 12.0
9 1.0 2020-01-01 00:37:15 2020-01-01 00:51:41 1.0 163 161 9.5
I have this data for January 2020(spans for the whole month, this is just a snippet) ,I want to answer a question like 'Saturday is the busiest day in terms of passenger pickups.'
How do I go about it?
the data type of column with labels 'tpep_pickup_datetime' and 'tpep_dropoff_datetime' are of object type.
First data was changed for different datetimes in column tpep_pickup_datetime for better sample:
print (df)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \
0 1.0 2020-01-01 00:28:15 2020-01-01 00:33:03 1.0
1 1.0 2020-01-02 00:35:39 2020-01-01 00:43:04 1.0
2 1.0 2020-01-02 00:47:41 2020-01-01 00:53:52 1.0
3 1.0 2020-01-03 00:55:23 2020-01-01 01:00:14 1.0
4 2.0 2020-01-03 00:01:58 2020-01-01 00:04:16 1.0
5 2.0 2020-01-03 00:09:44 2020-01-01 00:10:37 1.0
6 2.0 2020-01-04 00:39:25 2020-01-01 00:39:29 1.0
7 1.0 2020-01-04 00:29:01 2020-01-01 00:40:28 2.0
8 1.0 2020-01-04 00:55:11 2020-01-01 01:12:03 2.0
9 1.0 2020-01-05 00:37:15 2020-01-01 00:51:41 1.0
PULocationID DOLocationID fare_amount
0 238 239 6.0
1 239 238 7.0
2 238 238 6.0
3 238 151 5.5
4 193 193 3.5
5 7 193 2.5
6 193 193 2.5
7 246 48 8.0
8 246 79 12.0
9 163 161 9.5
Convert column to datetimes, get names of day by Series.dt.day_name and aggregate sum:
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['day'] = df['tpep_pickup_datetime'].dt.day_name()
s = df.groupby('day')['passenger_count'].sum()
print (s)
day
Friday 3.0
Saturday 5.0
Sunday 1.0
Thursday 2.0
Wednesday 1.0
Name: passenger_count, dtype: float64
Then for index, here name of day of maximal value use Series.idxmax, for maximal value use max:
print (s.idxmax())
Saturday
print (s.max())
5.0
And if need both is possible use Series.agg:
print (s.agg(['idxmax','max']))
idxmax Saturday
max 5
Name: passenger_count, dtype: object

Update single column based on multiple 'priority' columns

Suppose you had a DataFrame with a number of columns / Series- say five for example. If the fifth column (named 'Updated Col') had values, in addition to nans, what would be the best way to insert values into 'Updated Col' from other columns in place of the nans based on a preferred column order?
e.g. my dataframe looks something like this;
Date 1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 Nan
12/03/2017 0:40 0.1 Nan
12/03/2017 0:50 0.6 0.5 Nan
12/03/2017 1:00 0.4 0.3 Nan
12/03/2017 1:10 0.3 0.2 Nan
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..and say for example I wanted the values from column 3 as a priority, followed by 2, then 1, i would expect the DataFrame to look like this;
1 2 3 4 Updated Col
12/03/2017 0:00 0.4 0.9
12/03/2017 0:10 0.4 0.1
12/03/2017 0:20 0.4 0.6
12/03/2017 0:30 0.9 0.7 0.7
12/03/2017 0:40 0.1 0.1
12/03/2017 0:50 0.6 0.5 0.5
12/03/2017 1:00 0.4 0.3 0.3
12/03/2017 1:10 0.3 0.2 0.2
12/03/2017 1:20 0.9 0.8
12/03/2017 1:30 0.9 0.8
12/03/2017 1:40 0.0 0.9
..values would be input from the lower priority columns only if the higher priority columns were empty / NaN.
What would be the best way to do this?
I've tried numerous np.where attempts but cant work out what the best way would be?
Many thanks in advance.
You can use fillna with forward filling (ffill) and then select column:
updated_col = 'Updated Col'
#define columns for check, maybe [1,2,3,4] if integer colum names
cols = ['1','2','3','4'] + [updated_col]
print (df[cols].ffill(axis=1))
1 2 3 4 Updated Col
0 0.4 0.4 0.4 0.4 0.9
1 0.4 0.4 0.4 0.4 0.1
2 0.4 0.4 0.4 0.4 0.6
3 0.9 0.9 0.7 0.7 0.7
4 0.1 0.1 0.1 0.1 0.1
5 0.6 0.6 0.6 0.5 0.5
6 0.4 0.4 0.3 0.3 0.3
7 0.3 0.3 0.3 0.2 0.2
8 0.9 0.9 0.9 0.9 0.8
9 0.9 0.9 0.9 0.9 0.8
10 0.0 0.0 0.0 0.0 0.9
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9
EDIT:
Thank you shivsn for comments.
If have Nan (string values) in DataFrame what are not NaNs (missing values) or empty string values is necessary first replace:
updated_col = 'Updated Col'
cols = ['1','2','3','4'] + ['Updated Col']
d = {'Nan':np.nan, '': np.nan}
df = df.replace(d)
df[updated_col] = df[cols].ffill(axis=1)[updated_col]
print (df)
Date 1 2 3 4 Updated Col
0 12/03/2017 0:00 0.4 NaN NaN NaN 0.9
1 12/03/2017 0:10 0.4 NaN NaN NaN 0.1
2 12/03/2017 0:20 0.4 NaN NaN NaN 0.6
3 12/03/2017 0:30 0.9 NaN 0.7 NaN 0.7
4 12/03/2017 0:40 0.1 NaN NaN NaN 0.1
5 12/03/2017 0:50 0.6 NaN NaN 0.5 0.5
6 12/03/2017 1:00 0.4 NaN 0.3 NaN 0.3
7 12/03/2017 1:10 0.3 NaN NaN 0.2 0.2
8 12/03/2017 1:20 0.9 NaN NaN NaN 0.8
9 12/03/2017 1:30 0.9 NaN NaN NaN 0.8
10 12/03/2017 1:40 0.0 NaN NaN NaN 0.9

Resources