pd.to_datetime to solve '2010/1/1' rather than '2010/01/01' - python-3.x

I have a dataframe which contain a column 'trade_dt' like this
2009/12/1
2009/12/2
2009/12/3
2009/12/4
I got this problem
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y-&m-%d')
ValueError: time data '2009/12/1' does not match format '%Y-&m-%d' (match)
how to solve it? Thanks~

Need change format for match - replace & and - to % and /:
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y/%m/%d')
Also working with sample data removing format (but not sure with real data):
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'])
print (benchmark)
trade_dt
0 2009-12-01
1 2009-12-02
2 2009-12-03
3 2009-12-04

Related

How to handle errors with TimeDelta and Integers in Python

I need to calculate the distance between two dates.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1'])
When I save this into my SQLite DB the format is terrible, so I decided to use an integer format which is much better.
df3['dist_2_1'] = (df3['Date2'] - df3['Date1']).astype('timedelta64[D]').astype(int)
So far so good, but in a similar case, I've NULL values which cause an error when I try to do the diference between dates.
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
The Break_date can be null, so I want that in this case the final result in dist_B_3 is 0, but now is an error that breaks everything. I tested this so far, but doesn't work...
try:
if df3['Break_date'] == 'NaT':
df3['dist_B_3'] = 0
else:
df3['dist_B_3'] = df3['Break_date'] - df3['Date3']
#().astype('timedelta64[D]').astype(int)
except Exception:
print("error in the dist_B_3")
My df3['Break_date'] df is this one, so the NaT are the ones creating the error.
0 2022-07-13
1 2022-07-12
2 2022-07-14
3 2022-07-14
4 NaT
5 NaT
Any idea on how to handle this?

Failing to use sumproduct on date ranges with multiple conditions [Python]

From replacement data table (below on the image), I am trying to incorporate the solbox product replace in time series data format(above on the image). I need to extract out the number of consumers per day from the information.
What I need to find out:
On a specific date, which number of solbox product was active
On a specific date, which number of solbox product (which was a consumer) was active
I have used this line of code in excel but cannot implement this on python properly.
=SUMPRODUCT((Record_Solbox_Replacement!$O$2:$O$1367 = "consumer") * (A475>=Record_Solbox_Replacement!$L$2:$L$1367)*(A475<Record_Solbox_Replacement!$M$2:$M$1367))
I tried in python -
timebase_df['date'] = pd.date_range(start = replace_table_df['solbox_started'].min(), end = replace_table_df['solbox_started'].max(), freq = frequency)
timebase_df['date_unix'] = timebase_df['date'].astype(np.int64) // 10**9
timebase_df['no_of_solboxes'] = ((timebase_df['date_unix']>=replace_table_df['started'].to_numpy()) & (timebase_df['date_unix'] < replace_table_df['ended'].to_numpy() & replace_table_df['customer_type'] == 'customer']))
ERROR:
~\Anaconda3\Anaconda4\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
232 # The ambiguous case is object-dtype. See GH#27803
233 if len(lvalues) != len(rvalues):
--> 234 raise ValueError("Lengths must match to compare")
235
236 if should_extension_dispatch(lvalues, rvalues):
ValueError: Lengths must match to compare
Can someone help me please? I can explain in comment section if I have missed something.

Replace values in observations (i.e., multiple columns within multiple rows) based on multiple conditionals

I am trying to replace the values of 3 columns within multiple observations based on two conditionals ( e.g., specific ID after a particular date).
I have seen similar questions.
Pandas Multiple Conditions Function based on Column
Pandas replace, multi column criteria
Pandas: How do I assign values based on multiple conditions for existing columns?
Replacing values in a pandas dataframe based on multiple conditions
However, they did not quite address my problem or I can't quite manipulate them to solve my problem.
This code will generate a dataframe similar to mine:
df = pd.DataFrame({'SUR_ID': {0:'SUR1', 1:'SUR1', 2:'SUR1', 3:'SUR1', 4:'SUR2', 5:'SUR2'}, 'DATE': {0:'05-01-2019', 1:'05-11-2019', 2:'06-15-2019', 3:'06-20-2019', 4: '05-15-2019', 5:'06-20-2019'}, 'ACTIVE_DATE': {0:'05-01-2019', 1:'05-01-2019', 2:'05-01-2019', 3:'05-01-2019', 4: '05-01-2019', 5:'05-01-2019'}, 'UTM_X': {0:'444895', 1:'444895', 2:'444895', 3:'444895', 4: '445050', 5:'445050'}, 'UTM_Y': {0:'4077528', 1:'4077528', 2:'4077528', 3:'4077528', 4: '4077762', 5:'4077762'}})
Output Dataframe:
What I am trying to do:
I am trying to replace UTM_X,UTM_Y, AND ACTIVE_DATE with
[444917, 4077830, '06-04-2019']
when
SUR_ID is "SUR1" and DATE >= "2019-06-04 12:00:00"
This is a poorly adapted version of the solution for question 1 in attempts to fix my problem- throws error:
df.loc[[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00'], ['UTM_X', 'UTM_Y', 'Active_Date']] = [444917, 4077830, '06-04-2019']
First ensure that the column Date is of type datetime, and then when using 2 conditions, they need to be between parenthesis individually. so you can do:
df.DATE = pd.to_datetime(df.DATE)
df.loc[ (df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00')),
['UTM_X', 'UTM_Y', 'ACTIVE_DATE']] = [444917, 4077830, '06-04-2019']
See the difference between what you wrote for the boolean mask:
[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00']
and what is here with parenthesis
(df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00'))
Use:
df['UTM_X']=df['UTM_X'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),444917)
df['UTM_Y']=df['UTM_Y'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),4077830)
df['ACTIVE_DATE']=df['ACTIVE_DATE'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),'06-04-2019')
Output:
SUR_ID DATE ACTIVE_DATE UTM_X UTM_Y
0 SUR1 05-01-2019 05-01-2019 444895 4077528
1 SUR1 05-11-2019 05-01-2019 444895 4077528
2 SUR1 06-15-2019 06-04-2019 444917 4077830
3 SUR1 06-20-2019 06-04-2019 444917 4077830
4 SUR2 05-15-2019 05-01-2019 445050 4077762
5 SUR2 06-20-2019 05-01-2019 445050 4077762

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Converting time format

There must be a quick solution for this, but after 30min I gave up and need help.
This is the format of source data
0h56m40s 0h57m10s 1h00m40s 1h02m15s 1h02m25s
52m47s 54m25s 54m52s 57m23s 57m43s
49m30s 54m31s 54m34s 56m35s 56m36s
47m45s 48m03s 51m02s 52m23s 53m05s
46m54s 49m29s 50m51s 51m02s 51m03s
46m09s 47m56s 50m16s 51m20s 51m53s
46m55s 47m08s 47m13s 48m16s 50m11s
and I need this in time format like 0h56m40s to 0:56:40
I tried search/replace, from h to :, m to : and removing s, works for when there's hour, but messes up for when only minutes are there.
Any tips?
You can concatenate 0: if the input string is too short:
=(IF(LEN(A1)<7,"0:","") & SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,"h",":"),"m",":"),"s",""))+0
The +0 part converts string to time value (change cell format to h:mm:ss). If you prefer to keep it in text format, remove +0.

Resources