Related
I am trying to modify the overlapping time period problem so that if there is 1 day difference between dates, it should still be counted as an overlap. As long as the difference in dates is less than 2 days it should be seen as an overlap.
This is the dataframe containing the dates
df_dates = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103, 104, 104, 104, 102, 104, 104, 103, 106, 106, 106],
"start dates": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2002,10,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 8), pd.Timestamp(1993, 1, 1), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 16), pd.Timestamp(2002, 11, 16), pd.Timestamp(2005, 2, 23), pd.Timestamp(2005, 10, 11), pd.Timestamp(2015, 2, 9), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 12, 21)],
"end dates": [pd.Timestamp(2002, 1, 3), pd.Timestamp(2002, 12, 3),pd.Timestamp(2002,11,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2004, 11, 1), pd.Timestamp(2015, 2, 8), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 15) , pd.Timestamp(2005, 2, 21), pd.Timestamp(2003, 2, 16), pd.Timestamp(2005, 10, 8), pd.Timestamp(2005, 10, 21), pd.Timestamp(2015, 2, 17), pd.Timestamp(2011, 12, 31), pd.Timestamp(2011, 11, 25), pd.Timestamp(2011, 12, 22)]
})
This was helpful with answering the overlap question but I am not sure how to modify it (red circle) to include 1 day difference
This was my attempt at answering the question, which kind of did (red circle), but then the overlap calculation is not always right (yellow circle)
def Dates_Restructure(df, pers_id, start_dates, end_dates):
df.sort_values([pers_id, start_dates], inplace=True)
df['overlap'] = (df.groupby(pers_id)
.apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
.reset_index(level=0, drop=True))
df['cumsum'] = df.groupby(pers_id)['overlap'].cumsum()
return df.groupby([pers_id, 'cumsum']).aggregate({start_dates: min, end_dates: max}).reset_index()
I will appreciate your help with this. Thanks
This was the answer I came up with and it worked. I combined the 2 solutions in my question to get this solution.
def Dates_Restructure(df_dates, pers_id, start_dates, end_dates):
df2 = df_dates.copy()
startdf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[start_dates], 'start_end': 1})
enddf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[end_dates], 'start_end': -1})
mergedf2 = pd.concat([startdf2, enddf2]).sort_values([pers_id, 'time'])
mergedf2['cumsum'] = mergedf2.groupby(pers_id)['start_end'].cumsum()
mergedf2['new_start'] = mergedf2['cumsum'].eq(1) & mergedf2['start_end'].eq(1)
mergedf2['group'] = mergedf2.groupby(pers_id)['new_start'].cumsum()
df2['group_id'] = mergedf2['group'].loc[mergedf2['start_end'].eq(1)]
df3 = df2.groupby([pers_id, 'group_id']).aggregate({start_dates: min, end_dates: max}).reset_index()
df3.sort_values([pers_id, start_dates], inplace=True)
df3['overlap'] = (df3.groupby(pers_id).apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
.reset_index(level=0, drop=True))
df3['GROUP_ID'] = df3.groupby(pers_id)['overlap'].cumsum()
return df3.groupby([pers_id, 'GROUP_ID']).aggregate({start_dates: min, end_dates: max}).reset_index()
I am not sure how to go about constructing datetime object given year, month, week_of_month and day_of_week. Any clues? Using this I am trying to achieve following:
From (start_month, start_year) to (end_month, end_year) find monthly dates as specified by week_of_month and day_of_week parameters. Here 1 <= week_of_month <= 5 and 1 <= day_of_week <= 7. Now,
Each month may not have 5 weeks (eg. February in non-leap year)
1st and 5th week may not have 7 days.
In such cases, based on boolean is_to_next_day, if True then specify next calendar day, if False then skip it.
Sample input/outputs:
Input parameters: start_month=1 start_year=2020, end_month=12, end_year=2020, week_of_month=5, day_of_week=3, is_to_next_day=True
Desired output: [datetime(2020, 1, 29), datetime(2020, 2, 26), datetime(2020, 3, 25), datetime(2020, 4, 29), datetime(2020, 5, 27), datetime(2020, 7, 1), datetime(2020, 7, 29), datetime(2020, 8, 26), datetime(2020, 9, 30), datetime(2020, 10, 28), datetime(2020, 11, 25), datetime(2020, 12, 30)]
Input parameters: start_month=1 start_year=2020, end_month=12, end_year=2020, week_of_month=5, day_of_week=3, is_to_next_day=False
Desired output: [datetime(2020, 1, 29), datetime(2020, 2, 26), datetime(2020, 3, 25), datetime(2020, 4, 29), datetime(2020, 5, 27), datetime(2020, 7, 29), datetime(2020, 8, 26), datetime(2020, 9, 30), datetime(2020, 10, 28), datetime(2020, 11, 25), datetime(2020, 12, 30)]
import calendar
from datetime import datetime
def get_date(year, month, week_of_month, day_of_week, is_to_next_day):
mnth = calendar.monthcalendar(year, month)
if (week_of_month > 1) and (week_of_month < 5):
day = mnth[week_of_month - 1][day_of_week - 1]
return datetime(year, month, day)
elif week_of_month == 1:
last_day_of_first_week = mnth[0][6]
if day_of_week <= last_day_of_first_week:
return datetime(year, month, day_of_week)
elif is_to_next_day:
return datetime(year, month, mnth[1][0])
else:
return None
else:
if (len(mnth) >= week_of_month):
day = mnth[week_of_month - 1][day_of_week - 1]
if(day==0) and is_to_next_day:
return datetime(year + int((month + 1)/12), (month + 1)%12, 1)
elif(day==0):
return None
else:
return datetime(year, month, day)
if (len(mnth) < week_of_month):
if is_to_next_day:
return datetime(year + int((month + 1)/12), (month + 1)%12, 1)
else:
return None
# First output
[get_date(yy, mm, 5, 3, True) for mm in range(1, 13) for yy in [2020]]
# Second output
[get_date(yy, mm, 5, 3, False) for mm in range(1, 13) for yy in [2020]] # Iterate again to drop None.
I need to multiply the number in each tuple, not the order i[0] * j[0] and i[1] * j[1], but i[0] * i[0], i[0] * j[1], i[0] * j[2] and so on.
Moreover, I need to add the number as well, such as i[0] + i[0], i[0] + j[1], i[0] + j[2] and so on.
Is there an easy way to do this, instead of my code below that needs a lot of for?
dice1 = (1, 2, 3, 4)
dice2 = (1, 2, 3, 4, 5, 6, 7, 8)
dice3 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
dice4 = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
dice5 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
dice6 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
myList = []
comp = []
d = 0
e = 0
for i in dice1:
for j in dice2:
d = i * j
myList.append(d)
e = len(myList)
comp.append(e)
You can utilize the itertools product function as follows:
from itertools import product
dice1 = (1, 2, 3, 4)
dice2 = (1, 2, 3, 4, 5, 6, 7, 8)
dice3 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
dice4 = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
dice5 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
dice6 = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
myList = []
comp = []
myList = [k[0] * k[1] for k in product(dice1, dice2)]
comp.append(len(myList)
how to get last 20 days dates till current date using the date and datetime
Like eg CurrentDate = 2020-11-02
i can easily get the previous date
Here is the code
from datetime import date, timedelta
today = date.today()
yesterday = today - timedelta(days = 1)
print(today)
print(yesterday)
but how do i get the last past 20 days dates in python?
My expected output like
Dateslist= ['2020-10-13','2020-10-14','2020-10-15','2020-10-16','2020-10-17','2020-10-18','2020-10-19',
...., '2020-11-02']
Any Help would be appreciated and thanks in Advance
you can do this with list comprehension
today = date.today()
Dateslist = [today - timedelta(days = day) for day in range(20)]
This will return datime objects in case you need to use them anywhere else in the code, if you want the strings like the expected output just add str()
Dateslist = [str(today - timedelta(days = day)) for day in range(20)]
in case you need more advanced time formating in the string datetime.strftime() is worth checking
Try this:
from datetime import date, timedelta
today = date.today()
lst = []
for x in range(20):
lst.append(today - timedelta(days = x+1))
print(today)
print(lst)
Output:
[datetime.date(2020, 11, 1), datetime.date(2020, 10, 31), datetime.date(2020, 10, 30), datetime.date(2020, 10, 29), datetime.date(2020, 10, 28), datetime.date(2020, 10, 27), datetime.date(2020, 10, 26), datetime.date(2020, 10, 25), datetime.date(2020, 10, 24), datetime.date(2020, 10, 23), datetime.date(2020, 10, 22), datetime.date(2020, 10, 21), datetime.date(2020, 10, 20), datetime.date(2020, 10, 19), datetime.date(2020, 10, 18), datetime.date(2020, 10, 17), datetime.date(2020, 10, 16), datetime.date(2020, 10, 15), datetime.date(2020, 10, 14), datetime.date(2020, 10, 13)]
I have a RDD with tuples like (datetime, integer).
And I try to get another RDD of some interval summation with pyspark.
For example, from followings
(2015-09-30 10:00:01, 3)
(2015-09-30 10:00:02, 1)
(2015-09-30 10:00:05, 2)
(2015-09-30 10:00:06, 7)
(2015-09-30 10:00:07, 3)
(2015-09-30 10:00:10, 5)
I'm trying to get followings sum of every 3 seconds:
(2015-09-30 10:00:01, 4) # sum of 1, 2, 3 seconds
(2015-09-30 10:00:02, 1) # sum of 2, 3, 4 seconds
(2015-09-30 10:00:05, 12) # sum of 5, 6, 7 seconds
(2015-09-30 10:00:06, 10) # sum of 6, 7, 8 seconds
(2015-09-30 10:00:07, 3) # sum of 7, 8, 9 seconds
(2015-09-30 10:00:10, 5) # sum of 10, 11, 12 seconds
Please, could you give me any hints?
I will assume that your input is an RDD time_rdd with tuples where the first element is a datetime object and the second element is an integer. You could use a flatMap to map every datetime object to the previous 3 seconds and then use a reduceByKey to get the total count for that window.
def map_to_3_seconds(datetime_obj, count):
list_times = []
for i in range(-2, 1):
list_times.append((datetime_obj + timedelta(seconds = i), count))
return list_times
output_rdd = time_rdd.flatMap(lambda (datetime_obj, count): map_to_3_seconds(datetime_obj, count)).reduceByKey(lambda x,y: x+y)
This RDD will contain more datetime objects than the ones in the original RDD, so if you only want to have the original times, then you need to do a join with the time_rdd,
result = output_rdd.join(time_rdd).map(lambda (key, vals): (key, vals[0])).collect()
Now result will contain:
[(datetime.datetime(2015, 9, 30, 10, 0, 5), 12),
(datetime.datetime(2015, 9, 30, 10, 0, 2), 1),
(datetime.datetime(2015, 9, 30, 10, 0, 10), 5),
(datetime.datetime(2015, 9, 30, 10, 0, 1), 4),
(datetime.datetime(2015, 9, 30, 10, 0, 6), 10),
(datetime.datetime(2015, 9, 30, 10, 0, 7), 3)]