bumping to good business day when generating range - python-3.x
I am using pandas pd.bdate_range() to generate a range of dates given a start and end, but it seems to not work as expected.
What I am ultimately after is quarterly dates over a start and end date, but I want the dates to be valid business days.
start = '2015-06-01'
end = '2019-06-01'
dates = pd.bdate_range(start,end,freq='MS')[::3]
unfortunately this includes 2018-09-01 which is a Saturday
is there a more foolproof way to get an index of only business days, also taking account USFederalHolidayCalendar()?
You can take your existing Series and increment to the next business day like so
from pandas.tseries.offsets import BDay
start = '2015-06-01'
end = '2019-06-01'
dates = pd.bdate_range(start,end,freq='MS')[::3]
new_dates = dates.map(lambda x : x + 0*BDay())
Or you can pass BMS to the freq keyword attribute like so
start = '2015-06-01'
end = '2019-06-01'
dates = pd.bdate_range(start,end, freq='BMS')[::3]
Both give this output
DatetimeIndex(['2015-06-01', '2015-09-01', '2015-12-01', '2016-03-01',
'2016-06-01', '2016-09-01', '2016-12-01', '2017-03-01',
'2017-06-01', '2017-09-01', '2017-12-01', '2018-03-01',
'2018-06-01', '2018-09-03', '2018-12-03', '2019-03-01',
'2019-06-03'],
dtype='datetime64[ns]', freq=None)
I think you can pass the following to get what you desire.
freq='BMS' # Business month start
or
freq='BQS' # Business quarter start
Update:
You could do something like this take care of holidays that fall on month/quarter start.
from pandas import DatetimeIndex
from pandas.tseries.holiday import USFederalHolidayCalendar
holidays = USFederalHolidayCalendar().holidays(start, end, return_name=False)
month_dates = pandas.bdate_range(start, end, freq='CBMS', holidays=[holiday for holiday in holidays])
print(month_dates)
print(DatetimeIndex([e[1] for e in zip(month_dates.month, month_dates) if e[0] in {1, 4, 7, 10}]))
DatetimeIndex(['2015-01-02', '2015-02-02', '2015-03-02', '2015-04-01',
'2015-05-01', '2015-06-01', '2015-07-01', '2015-08-03',
'2015-09-01', '2015-10-01', '2015-11-02', '2015-12-01',
'2016-01-04', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-02', '2016-06-01', '2016-07-01', '2016-08-01',
'2016-09-01', '2016-10-03', '2016-11-01', '2016-12-01',
'2017-01-03', '2017-02-01', '2017-03-01', '2017-04-03',
'2017-05-01', '2017-06-01', '2017-07-03', '2017-08-01',
'2017-09-01', '2017-10-02', '2017-11-01', '2017-12-01',
'2018-01-02', '2018-02-01', '2018-03-01', '2018-04-02',
'2018-05-01', '2018-06-01', '2018-07-02', '2018-08-01',
'2018-09-04', '2018-10-01', '2018-11-01', '2018-12-03',
'2019-01-02', '2019-02-01', '2019-03-01', '2019-04-01',
'2019-05-01'],
dtype='datetime64[ns]', freq='CBMS')
DatetimeIndex(['2015-01-02', '2015-04-01', '2015-07-01', '2015-10-01',
'2016-01-04', '2016-04-01', '2016-07-01', '2016-10-03',
'2017-01-03', '2017-04-03', '2017-07-03', '2017-10-02',
'2018-01-02', '2018-04-02', '2018-07-02', '2018-10-01',
'2019-01-02', '2019-04-01'],
dtype='datetime64[ns]', freq=None)
Related
Last trading day of each month from a given list using python
date = ['2010-01-11' '2010-01-12' '2010-01-13' '2010-01-14' '2010-01-15' '2010-01-16' '2010-01-17' '2010-01-18' '2010-01-19' '2010-01-20' '2010-01-21' '2010-01-22' '2010-01-23' '2010-01-24' '2010-01-25' '2010-01-26' '2010-01-27' '2010-01-28' '2010-01-29' '2010-01-30' '2010-01-31' '2010-02-01' '2010-02-02' '2010-02-03' '2010-02-04' '2010-02-05' '2010-02-06' '2010-02-07' '2010-02-08' '2010-02-09' '2010-02-10' '2010-02-11' '2010-02-12' '2010-02-13' '2010-02-14' '2010-02-15' '2010-02-16' '2010-02-17' '2010-02-18' '2010-02-19' '2010-02-20' '2010-02-21' '2010-02-22' '2010-02-23' '2010-02-24' '2010-02-25' '2010-02-26' '2010-02-27' '2010-02-28' '2010-03-01' '2010-03-02' '2010-03-03' '2010-03-04' '2010-03-05' '2010-03-06' '2010-03-07' '2010-03-08' '2010-03-09' '2010-03-10' '2010-03-11' '2010-03-12' '2010-03-13' '2010-03-14' '2010-03-15' '2010-03-16' '2010-03-17' '2010-03-18' '2010-03-19' '2010-03-20' '2010-03-21' '2010-03-22' '2010-03-23' '2010-03-24' '2010-03-25' '2010-03-26' '2010-03-27' '2010-03-28' '2010-03-29' '2010-03-30' '2010-03-31' '2010-04-01' '2010-04-02' '2010-04-03' '2010-04-04' '2010-04-05' '2010-04-06' '2010-04-07' '2010-04-08' '2010-04-09' '2010-04-10' '2010-04-11' '2010-04-12' '2010-04-13' '2010-04-14' '2010-04-15' '2010-04-16' '2010-04-17' '2010-04-18' '2010-04-19' '2010-04-20' '2010-04-21' '2010-04-22' '2010-04-23' '2010-04-24' '2010-04-25' '2010-04-26' '2010-04-27' '2010-04-28' '2010-04-29' '2010-04-30' '2010-05-01' '2010-05-02' '2010-05-03' '2010-05-04' '2010-05-05' '2010-05-06' '2010-05-07' '2010-05-08' '2010-05-09' '2010-05-10' '2010-05-11' '2010-05-12' '2010-05-13' '2010-05-14' '2010-05-15' '2010-05-16' '2010-05-17' '2010-05-18' '2010-05-19' '2010-05-20' '2010-05-21' '2010-05-22' '2010-05-23' '2010-05-24' '2010-05-25' '2010-05-26' '2010-05-27' '2010-05-28' '2010-05-29' '2010-05-30' '2010-05-31' '2010-06-01' '2010-06-02' '2010-06-03' '2010-06-04' '2010-06-05' '2010-06-06' '2010-06-07' '2010-06-08' '2010-06-09' '2010-06-10' '2010-06-11' '2010-06-12' '2010-06-13' '2010-06-14' '2010-06-15' '2010-06-16' '2010-06-17' '2010-06-18' '2010-06-19' '2010-06-20' '2010-06-21' '2010-06-22' '2010-06-23' '2010-06-24' '2010-06-25' '2010-06-26' '2010-06-27' '2010-06-28' '2010-06-29' '2010-06-30'] cant seem to figure out the coding to extract the last day of each month in the above list. please note that the last day of each month in the above list does not necessary equivalent to the last day of each calender month. Expected output: ['2010-01-29', '2010-02-26', '2010-03-31', '2010-04-30', '2010-05-28', '2010-06-30'] saw some solution as follows but it does not return to an valid outcome: date = date - pd.tseries.offsets.MonthEnd()
previous_month = '01' last_trading_days = [] for index, day in enumerate(date): # Extract month from date month = day[5:7] # If this is the first day of the new month, append the day that came before it if month != previous_month: previous_month = month last_trading_days.append(date[index - 1]) # Also append the last day if index == len(date) - 1: last_trading_days.append(day) print(last_trading_days) This is if you know the first month will be January, otherwise you can use previous_month = date[0][5:7] to start on the month of the first date in the list.
Download stock data for each last trading day of the month till date: df = yf.download(symbol,period='max') df = df.groupby(df.index.strftime('%Y-%m')).tail(1)
python3: Split time series by diurnal periods
I have the following dataset: 01/05/2020,00,26.3,27.5,26.3,80,81,73,22.5,22.7,22.0,993.7,993.7,993.0,0.0,178,1.2,-3.53,0.0 01/05/2020,01,26.1,26.8,26.1,79,80,75,22.2,22.4,21.9,994.4,994.4,993.7,1.1,22,2.0,-3.54,0.0 01/05/2020,02,25.4,26.1,25.4,80,81,79,21.6,22.3,21.6,994.7,994.7,994.4,0.1,335,2.3,-3.54,0.0 01/05/2020,03,23.3,25.4,23.3,90,90,80,21.6,21.8,21.5,994.7,994.8,994.6,0.9,263,1.5,-3.54,0.0 01/05/2020,04,22.9,24.2,22.9,89,90,86,21.0,22.1,21.0,994.2,994.7,994.2,0.3,268,2.0,-3.54,0.0 01/05/2020,05,22.8,23.1,22.8,90,91,89,21.0,21.4,20.9,993.6,994.2,993.6,0.7,264,1.5,-3.54,0.0 01/05/2020,06,22.2,22.8,22.2,92,92,90,20.9,21.2,20.8,993.6,993.6,993.4,0.8,272,1.6,-3.54,0.0 01/05/2020,07,22.6,22.6,22.0,91,93,91,21.0,21.2,20.7,993.4,993.6,993.4,0.4,284,2.3,-3.49,0.0 01/05/2020,08,21.6,22.6,21.5,92,92,90,20.2,20.9,20.1,993.8,993.8,993.4,0.4,197,2.1,-3.54,0.0 01/05/2020,09,22.0,22.1,21.5,92,93,92,20.7,20.8,20.2,994.3,994.3,993.7,0.0,125,2.1,-3.53,0.0 01/05/2020,10,22.7,22.7,21.9,91,92,91,21.2,21.2,20.5,995.0,995.0,994.3,0.0,354,0.0,70.99,0.0 01/05/2020,11,25.0,25.0,22.7,83,91,82,21.8,22.1,21.1,995.5,995.5,995.0,0.8,262,1.5,744.8,0.0 01/05/2020,12,27.9,28.1,24.9,72,83,70,22.3,22.8,21.6,996.1,996.1,995.5,0.7,228,1.9,1392.,0.0 01/05/2020,13,30.4,30.4,27.7,58,72,55,21.1,22.6,20.4,995.9,996.2,995.9,1.6,134,3.7,1910.,0.0 01/05/2020,14,31.7,32.3,30.1,50,58,48,20.2,21.3,19.7,995.8,996.1,995.8,3.0,114,5.4,2577.,0.0 01/05/2020,15,32.9,33.2,31.8,44,50,43,19.1,20.5,18.6,994.9,995.8,994.9,0.0,128,5.6,2853.,0.0 01/05/2020,16,33.2,34.4,32.0,46,48,41,20.0,20.0,18.2,994.0,994.9,994.0,0.0,125,4.3,2700.,0.0 01/05/2020,17,33.1,34.5,32.7,44,46,39,19.2,19.9,18.5,993.4,994.1,993.4,0.0,170,1.6,2806.,0.0 01/05/2020,18,33.6,34.2,32.6,41,47,40,18.5,20.0,18.3,992.6,993.4,992.6,0.0,149,0.0,2319.,0.0 01/05/2020,19,33.5,34.7,32.1,43,49,39,19.2,20.4,18.3,992.3,992.6,992.3,0.3,168,4.1,1907.,0.0 01/05/2020,20,32.1,33.9,32.1,49,51,41,20.2,20.7,18.5,992.4,992.4,992.3,0.1,192,3.7,1203.,0.0 01/05/2020,21,29.9,32.2,29.9,62,62,49,21.8,21.9,20.2,992.3,992.4,992.2,0.0,188,2.9,408.0,0.0 01/05/2020,22,28.5,29.9,28.4,67,67,62,21.8,22.0,21.7,992.5,992.5,992.3,0.4,181,2.3,6.817,0.0 01/05/2020,23,27.8,28.5,27.8,71,71,66,22.1,22.1,21.5,993.1,993.1,992.5,0.0,225,1.6,-3.39,0.0 02/05/2020,00,27.4,28.2,27.3,75,75,68,22.5,22.5,21.7,993.7,993.7,993.1,0.5,139,1.5,-3.54,0.0 02/05/2020,01,27.3,27.7,27.3,72,75,72,21.9,22.6,21.9,994.3,994.3,993.7,0.0,126,1.1,-3.54,0.0 02/05/2020,02,25.4,27.3,25.2,85,85,72,22.6,22.8,21.9,994.4,994.5,994.3,0.1,256,2.6,-3.54,0.0 02/05/2020,03,25.5,25.6,25.3,84,85,82,22.5,22.7,22.1,994.3,994.4,994.2,0.0,329,0.7,-3.54,0.0 02/05/2020,04,24.5,25.5,24.5,86,86,82,22.0,22.5,21.9,993.9,994.3,993.9,0.0,290,1.2,-3.54,0.0 02/05/2020,05,24.0,24.5,23.5,87,88,86,21.6,22.1,21.3,993.6,993.9,993.6,0.7,285,1.3,-3.54,0.0 02/05/2020,06,23.7,24.1,23.7,87,87,85,21.3,21.6,21.3,993.1,993.6,993.1,0.1,305,1.1,-3.51,0.0 02/05/2020,07,22.7,24.1,22.5,91,91,86,21.0,21.7,20.7,993.1,993.3,993.1,0.6,220,1.1,-3.54,0.0 02/05/2020,08,22.9,22.9,22.6,92,92,91,21.5,21.5,21.0,993.2,993.2,987.6,0.0,239,1.5,-3.53,0.0 02/05/2020,09,22.9,23.0,22.8,93,93,92,21.7,21.7,21.4,993.6,993.6,993.2,0.0,289,0.4,-3.53,0.0 02/05/2020,10,23.5,23.5,22.8,92,93,92,22.1,22.1,21.6,994.3,994.3,993.6,0.0,256,0.0,91.75,0.0 02/05/2020,11,26.1,26.2,23.5,80,92,80,22.4,23.1,22.2,995.0,995.0,994.3,1.1,141,1.9,789.0,0.0 02/05/2020,12,28.7,28.7,26.1,69,80,68,22.4,22.7,22.1,995.5,995.5,995.0,0.0,116,2.2,1468.,0.0 02/05/2020,13,31.4,31.4,28.6,56,69,56,21.6,22.9,21.0,995.5,995.7,995.4,0.0,65,0.0,1762.,0.0 02/05/2020,14,32.1,32.4,30.6,48,58,47,19.8,22.0,19.3,995.0,995.6,990.6,0.0,105,0.0,2657.,0.0 02/05/2020,15,34.0,34.2,31.7,43,48,42,19.6,20.1,18.6,993.9,995.0,993.9,3.0,71,6.0,2846.,0.0 02/05/2020,16,34.7,34.7,32.3,38,48,38,18.4,20.3,18.3,992.7,993.9,992.7,1.4,63,6.3,2959.,0.0 02/05/2020,17,34.0,34.7,32.7,42,46,38,19.2,20.0,18.4,991.7,992.7,991.7,2.2,103,4.8,2493.,0.0 02/05/2020,18,34.3,34.7,33.6,41,42,38,19.1,19.4,18.0,991.2,991.7,991.2,2.0,141,4.8,2593.,0.0 02/05/2020,19,33.5,34.5,32.5,42,47,39,18.7,20.0,18.4,990.7,991.4,989.9,1.8,132,4.2,1317.,0.0 02/05/2020,20,32.5,34.2,32.5,47,48,40,19.7,20.3,18.7,990.5,990.7,989.8,1.3,191,4.2,1250.,0.0 02/05/2020,21,30.5,32.5,30.5,59,59,47,21.5,21.6,20.0,979.8,990.5,979.5,0.1,157,2.9,345.5,0.0 02/05/2020,22,28.6,30.5,28.6,67,67,59,21.9,21.9,21.5,978.9,980.1,978.7,0.6,166,2.2,1.122,0.0 02/05/2020,23,27.2,28.7,27.2,74,74,66,22.1,22.2,21.6,978.9,979.3,978.6,0.0,246,1.7,-3.54,0.0 03/05/2020,00,26.5,27.2,26.0,77,80,74,22.2,22.5,22.0,979.0,979.1,978.7,0.0,179,1.4,-3.54,0.0 03/05/2020,01,26.0,26.6,26.0,80,80,77,22.4,22.5,22.1,979.1,992.4,978.7,0.0,276,0.6,-3.54,0.0 03/05/2020,02,26.0,26.5,26.0,79,81,75,22.1,22.5,21.7,978.8,979.1,978.5,0.0,290,0.6,-3.53,0.0 03/05/2020,03,25.3,26.0,25.3,83,83,79,22.2,22.4,21.8,978.6,989.4,978.5,0.5,303,1.0,-3.54,0.0 03/05/2020,04,25.3,25.6,24.6,81,85,81,21.9,22.5,21.7,978.1,992.7,977.9,0.7,288,1.5,-3.00,0.0 03/05/2020,05,23.7,25.3,23.7,88,88,81,21.5,21.9,21.5,977.6,991.8,977.3,1.2,256,1.8,-3.54,0.0 03/05/2020,06,23.3,23.7,23.3,91,91,88,21.7,21.7,21.5,976.9,977.6,976.7,0.4,245,1.8,-3.54,0.0 03/05/2020,07,23.0,23.6,23.0,91,91,89,21.4,21.9,21.3,976.7,977.0,976.4,0.9,257,1.9,-3.54,0.0 03/05/2020,08,23.4,23.4,22.9,90,92,90,21.7,21.7,21.3,976.8,976.9,976.5,0.4,294,1.6,-3.52,0.0 03/05/2020,09,23.0,23.5,23.0,88,90,87,21.0,21.6,20.9,992.1,992.1,976.7,0.8,263,1.6,-3.54,0.0 03/05/2020,10,23.2,23.2,22.5,91,92,88,21.6,21.6,20.8,993.0,993.0,992.2,0.1,226,1.5,29.03,0.0 03/05/2020,11,26.0,26.1,23.2,77,91,76,21.6,22.1,21.5,993.8,993.8,982.1,0.0,120,0.9,458.1,0.0 03/05/2020,12,26.6,27.0,25.5,76,80,76,22.1,22.5,21.4,982.7,994.3,982.6,0.3,121,2.3,765.3,0.0 03/05/2020,13,28.5,28.7,26.6,66,77,65,21.5,23.1,21.2,982.5,994.2,982.4,1.4,130,3.2,1219.,0.0 03/05/2020,14,31.1,31.1,28.5,55,66,53,21.0,21.8,19.9,982.3,982.7,982.1,1.2,129,3.7,1743.,0.0 03/05/2020,15,31.6,31.8,30.7,50,55,49,19.8,20.8,19.2,992.9,993.5,982.2,1.1,119,5.1,1958.,0.0 03/05/2020,16,32.7,32.8,31.1,46,52,46,19.6,20.7,19.2,991.9,992.9,991.9,0.8,122,4.4,1953.,0.0 03/05/2020,17,32.3,33.3,32.0,44,49,42,18.6,20.2,18.2,990.7,991.9,979.0,2.6,133,5.9,2463.,0.0 03/05/2020,18,33.1,33.3,31.9,44,50,44,19.3,20.8,18.9,989.9,990.7,989.9,1.1,170,5.4,2033.,0.0 03/05/2020,19,32.4,33.2,32.2,47,47,44,19.7,20.0,18.7,989.5,989.9,989.5,2.4,152,5.2,1581.,0.0 03/05/2020,20,31.2,32.5,31.2,53,53,46,20.6,20.7,19.4,989.5,989.7,989.5,1.7,159,4.6,968.6,0.0 03/05/2020,21,29.7,32.0,29.7,62,62,51,21.8,21.8,20.5,989.7,989.7,989.4,0.8,154,4.0,414.2,0.0 03/05/2020,22,28.3,29.7,28.3,69,69,62,22.1,22.1,21.7,989.9,989.9,989.7,0.3,174,2.0,6.459,0.0 03/05/2020,23,26.9,28.5,26.9,75,75,67,22.1,22.5,21.7,990.5,990.5,989.8,0.2,183,1.0,-3.54,0.0 The second column is time (hour). I want to separate the dataset by morning (06-11), afternoon (12-17), evening (18-23) and night (00-05). How I can do it?
You can use pd.cut: bins = [-1,5,11,17,24] labels = ['morning', 'afternoon', 'evening', 'night'] df['day_part'] = pd.cut(df['hour'], bins=bins, labels=labels)
I added column names, including Hour for the second column. Then I used read_csv which reads the source text, "dropping" leading zeroes, so that Hour column is just int. To split rows (add a column marking the diurnal period), use: df['period'] = pd.cut(df.Hour, bins=[0, 6, 12, 18, 24], right=False, labels=['night', 'morning', 'afternoon', 'evening']) Then you can e.g. use groupby to process your groups. Because I used right=False parameter, the bins are closed on the left side, thus bin limits are more natural (no need for -1 as an hour). And bin limits (except for the last) are just starting hours of each period - quite natural notation.
How to do a vector of dates in python? [duplicate]
I'm trying to generate a date range of monthly data where the day is always at the beginning of the month: pd.date_range(start='1/1/1980', end='11/1/1991', freq='M') This generates 1/31/1980, 2/29/1980, and so on. Instead, I just want 1/1/1980, 2/1/1980,... I've seen other question ask about generating data that is always on a specific day of the month, with answers saying it wasn't possible, but beginning of month surely must be possible!
You can do this by changing the freq argument from 'M' to 'MS': d = pandas.date_range(start='1/1/1980', end='11/1/1990', freq='MS') print(d) This should now print: DatetimeIndex(['1980-01-01', '1980-02-01', '1980-03-01', '1980-04-01', '1980-05-01', '1980-06-01', '1980-07-01', '1980-08-01', '1980-09-01', '1980-10-01', ... '1990-02-01', '1990-03-01', '1990-04-01', '1990-05-01', '1990-06-01', '1990-07-01', '1990-08-01', '1990-09-01', '1990-10-01', '1990-11-01'], dtype='datetime64[ns]', length=131, freq='MS', tz=None) Look into the offset aliases part of the documentation. There it states that 'M' is for the end of the month (month end frequency) while 'MS' for the beginning (month start frequency).
It is worth noting that pandas.date_range() only includes dates within the defined interval, which may not be expected : start = "2020-03-08" end = "2021-03-08" pd.date_range(start, end, freq='MS') results in DatetimeIndex(['2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01', '2021-01-01', '2021-02-01', '2021-03-01'], dtype='datetime64[ns]', freq='MS') For MS, a workaround to include the first day of the opening month is to work only with the year and month of the start date : pd.date_range(start[:7], end, freq='MS') will then give DatetimeIndex(['2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01', '2021-01-01', '2021-02-01', '2021-03-01'], dtype='datetime64[ns]', freq='MS') If you wish to keep the same starting day for each month, you can then add the offset with pd.DateOffset() : pd.date_range(start[:7], end, freq='MS') + pd.DateOffset(days=7) will give DatetimeIndex(['2020-03-08', '2020-04-08', '2020-05-08', '2020-06-08', '2020-07-08', '2020-08-08', '2020-09-08', '2020-10-08', '2020-11-08', '2020-12-08', '2021-01-08', '2021-02-08', '2021-03-08'], dtype='datetime64[ns]', freq=None) As mentioned in comments, note that trouble may come with this workaround for offsets higher or equals to 28.
To check if the continuity of dates are missing in a column
I want to check in my dataframe's column that if there is a missing date for a certain month then the code should output the following month in the format MMM- YYYY The data set looks like this : date_start_balance date_end_balance start_balance 22.02.16 22.03.16 3590838 22.04.16 22.05.16 69788 15.06.16 21.07.16 452165 Both date cols are in datetime format. Now in the above data set the dates are missing for March and May in the start col and this should be returned as MMM-YYYYY I have tried the following code : import datetime dates = df1['date_start_balance'].tolist() missing = [] for i in range(0,len(dates)-1): if dates[i+1].month - dates[i+1].month != 1: for j in range(dates[i].month+1,dates[i+1].month): missing.append(datetime(dates[i].year, j,1)) print(missing)
You can first create a date range with pd.date_range march = pd.date_range(start='2016-05-01', end='2016-05-31') And then you will have the list with the dates that you already have, in the example there is only one date: 2016-05-15: your_list = [datetime.datetime.strptime('15052016', "%d%m%Y").date()] And then you can calculate the difference between the range and your list and get the dates that you are missing: march.difference(your_list) DatetimeIndex(['2016-05-01', '2016-05-02', '2016-05-03', '2016-05-04', '2016-05-05', '2016-05-06', '2016-05-07', '2016-05-08', '2016-05-09', '2016-05-10', '2016-05-11', '2016-05-12', '2016-05-13', '2016-05-14', '2016-05-16', '2016-05-17', '2016-05-18', '2016-05-19', '2016-05-20', '2016-05-21', '2016-05-22', '2016-05-23', '2016-05-24', '2016-05-25', '2016-05-26', '2016-05-27', '2016-05-28', '2016-05-29', '2016-05-30', '2016-05-31'], dtype='datetime64[ns]', freq=None)
Remove certain dates in list. Python 3.4
I have a list that has several days in it. Each day have several timestamps. What I want to do is to make a new list that only takes the start time and the end time in the list for each date. I also want to delete the Character between the date and the time on each one, the char is always the same type of letter. the time stamps can vary in how many they are on each date. Since I'm new to python it would be preferred to use a lot of simple to understand codes. I've been using a lot of regex so pleas if there is a way with this one. the list has been sorted with the command list.sort() so it's in the correct order. code used to extract the information was the following. file1 = open("test.txt", "r") for f in file1: list1 += re.findall('20\d\d-\d\d-\d\dA\d\d\:\d\d', f) listX = (len(list1)) list2 = list1[0:listX - 2] list2.sort() here is a list of how it looks: 2015-12-28A09:30 2015-12-28A09:30 2015-12-28A09:35 2015-12-28A09:35 2015-12-28A12:00 2015-12-28A12:00 2015-12-28A12:15 2015-12-28A12:15 2015-12-28A14:30 2015-12-28A14:30 2015-12-28A15:15 2015-12-28A15:15 2015-12-28A16:45 2015-12-28A16:45 2015-12-28A17:00 2015-12-28A17:00 2015-12-28A18:15 2015-12-28A18:15 2015-12-29A08:30 2015-12-29A08:30 2015-12-29A08:35 2015-12-29A08:35 2015-12-29A10:45 2015-12-29A10:45 2015-12-29A11:00 2015-12-29A11:00 2015-12-29A13:15 2015-12-29A13:15 2015-12-29A14:00 2015-12-29A14:00 2015-12-29A15:30 2015-12-29A15:30 2015-12-29A15:45 2015-12-29A15:45 2015-12-29A17:15 2015-12-29A17:15 2015-12-30A08:30 2015-12-30A08:30 2015-12-30A08:35 2015-12-30A08:35 2015-12-30A10:45 2015-12-30A10:45 2015-12-30A11:00 2015-12-30A11:00 2015-12-30A13:00 2015-12-30A13:00 2015-12-30A13:45 2015-12-30A13:45 2015-12-30A15:15 2015-12-30A15:15 2015-12-30A15:30 2015-12-30A15:30 2015-12-30A17:15 2015-12-30A17:15 And this is how I want it to look like: 2015-12-28 09:30 2015-12-28 18:15 2015-12-29 08:30 2015-12-29 17:15 2015-12-30 08:30 2015-12-30 17:15
First of all, you should convert all your strings into proper dates, Python can work with. That way, you have a lot more control on it, also to change the formatting later. So let’s parse your dates using datetime.strptime in list2: from datetime import datetime dates = [datetime.strptime(item, '%Y-%m-%dA%H:%M') for item in list2] This creates a new list dates that contains all your dates from list2 but as parsed datetime object. Now, since you want to get the first and the last date of each day, we somehow have to group your dates by the date component. There are various ways to do that. I’ll be using itertools.groupby for it, with a key function that just looks at the date component of each entry: from itertools import groupby for day, times in groupby(dates, lambda x: x.date()): first, *mid, last = times print(first) print(last) If we run this, we already get your output (without date formatting): 2015-12-28 09:30:00 2015-12-28 18:15:00 2015-12-29 08:30:00 2015-12-29 17:15:00 2015-12-30 08:30:00 2015-12-30 17:15:00 Of course, you can also collect that first and last date in a list first to process the dates later: filteredDates = [] for day, times in groupby(dates, lambda x: x.date()): first, *mid, last = times filteredDates.append(first) filteredDates.append(last) And you can also output your dates with a different format using datetime.strftime: for date in filteredDates: print(date.strftime('%Y-%m-%d %H:%M')) That would give us the following output: 2015-12-28 09:30 2015-12-28 18:15 2015-12-29 08:30 2015-12-29 17:15 2015-12-30 08:30 2015-12-30 17:15 If you don’t want to go the route through parsing those dates, of course you could also do this simply by working on the strings. Since they are nicely formatted (i.e. they can be easily compared), you can do that as well. It would look like this then: for day, times in groupby(list2, lambda x: x[:10]): first, *mid, last = times print(first) print(last) Producing the following output: 2015-12-28A09:30 2015-12-28A18:15 2015-12-29A08:30 2015-12-29A17:15 2015-12-30A08:30 2015-12-30A17:15
Because your data is ordered you just need to pull the first and last value from each group, you can use re.sub to remove the single letter replacing it with a space then split each date string just comparing the dates: from re import sub def grp(l): it = iter(l) prev = start = next(it).replace("A"," ") for dte in it: dte = dte.replace("A"," ") # if we have a new date, yield that start and end if dte.split(None, 1)[0] != prev.split(None,1)[0]: yield start yield prev start = dte prev = dte yield start, prev l=["2015-12-28A09:30", "2015-12-28A09:30", ..................... l[:] = grp(l) This could also certainly be done as your process the file without sorting by using a dict to group: from re import findall from collections import OrderedDict with open("dates.txt") as f: od = defaultdict(lambda: {"min": "null", "max": ""}) for line in f: for dte in findall('20\d\d-\d\d-\d\dA\d\d\:\d\d', line): dte, tme = dte.split("A") _dte = "{} {}".format(dte, tme) if od[dte]["min"] > _dte: od[dte]["min"] = _dte if od[dte]["max"] < _dte: od[dte]["max"] = _dt print(list(od.values())) Which will give you the start and end time for each date. [{'min': '2016-01-03 23:59', 'max': '2016-01-03 23:59'}, {'min': '2015-12-28 00:00', 'max': '2015-12-28 18:15'}, {'min': '2015-12-30 08:30', 'max': '2015-12-30 17:15'}, {'min': '2015-12-29 08:30', 'max': '2015-12-29 17:15'}, {'min': '2015-12-15 08:41', 'max': '2015-12-15 08:41'}] The start for 2015-12-28 is also 00:00 not 9:30. if you dates are actually as posted one per line you don't need a regex either: from collections import defaultdict with open("dates.txt") as f: od = defaultdict(lambda: {"min": "null", "max": ""}) for line in f: dte, tme = line.rstrip().split("A") _dte = "{} {}".format(dte, tme) if od[dte]["min"] > _dte: od[dte]["min"] = _dte if od[dte]["max"] < _dte: od[dte]["max"] = _dte print(list(od.values() Which would give you the same output.