Modifying overlapping time period to include 1 day difference - python-3.x

I am trying to modify the overlapping time period problem so that if there is 1 day difference between dates, it should still be counted as an overlap. As long as the difference in dates is less than 2 days it should be seen as an overlap.
This is the dataframe containing the dates
df_dates = pd.DataFrame({"id": [102, 102, 102, 102, 103, 103, 104, 104, 104, 102, 104, 104, 103, 106, 106, 106],
"start dates": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2002,10,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 8), pd.Timestamp(1993, 1, 1), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 16), pd.Timestamp(2002, 11, 16), pd.Timestamp(2005, 2, 23), pd.Timestamp(2005, 10, 11), pd.Timestamp(2015, 2, 9), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 11, 24), pd.Timestamp(2011, 12, 21)],
"end dates": [pd.Timestamp(2002, 1, 3), pd.Timestamp(2002, 12, 3),pd.Timestamp(2002,11,20), pd.Timestamp(2003, 4, 4), pd.Timestamp(2004, 11, 1), pd.Timestamp(2015, 2, 8), pd.Timestamp(2005, 2, 3), pd.Timestamp(2005, 2, 15) , pd.Timestamp(2005, 2, 21), pd.Timestamp(2003, 2, 16), pd.Timestamp(2005, 10, 8), pd.Timestamp(2005, 10, 21), pd.Timestamp(2015, 2, 17), pd.Timestamp(2011, 12, 31), pd.Timestamp(2011, 11, 25), pd.Timestamp(2011, 12, 22)]
})
This was helpful with answering the overlap question but I am not sure how to modify it (red circle) to include 1 day difference
This was my attempt at answering the question, which kind of did (red circle), but then the overlap calculation is not always right (yellow circle)
def Dates_Restructure(df, pers_id, start_dates, end_dates):
df.sort_values([pers_id, start_dates], inplace=True)
df['overlap'] = (df.groupby(pers_id)
.apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
.reset_index(level=0, drop=True))
df['cumsum'] = df.groupby(pers_id)['overlap'].cumsum()
return df.groupby([pers_id, 'cumsum']).aggregate({start_dates: min, end_dates: max}).reset_index()
I will appreciate your help with this. Thanks

This was the answer I came up with and it worked. I combined the 2 solutions in my question to get this solution.
def Dates_Restructure(df_dates, pers_id, start_dates, end_dates):
df2 = df_dates.copy()
startdf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[start_dates], 'start_end': 1})
enddf2 = pd.DataFrame({pers_id: df2[pers_id], 'time': df2[end_dates], 'start_end': -1})
mergedf2 = pd.concat([startdf2, enddf2]).sort_values([pers_id, 'time'])
mergedf2['cumsum'] = mergedf2.groupby(pers_id)['start_end'].cumsum()
mergedf2['new_start'] = mergedf2['cumsum'].eq(1) & mergedf2['start_end'].eq(1)
mergedf2['group'] = mergedf2.groupby(pers_id)['new_start'].cumsum()
df2['group_id'] = mergedf2['group'].loc[mergedf2['start_end'].eq(1)]
df3 = df2.groupby([pers_id, 'group_id']).aggregate({start_dates: min, end_dates: max}).reset_index()
df3.sort_values([pers_id, start_dates], inplace=True)
df3['overlap'] = (df3.groupby(pers_id).apply(lambda x: (x[end_dates].shift() - x[start_dates]) < timedelta(days=-1))
.reset_index(level=0, drop=True))
df3['GROUP_ID'] = df3.groupby(pers_id)['overlap'].cumsum()
return df3.groupby([pers_id, 'GROUP_ID']).aggregate({start_dates: min, end_dates: max}).reset_index()

Related

How to sort numpy.recarray based on datetime field

I have built a recarray with np.rec.fromarrays and following structure :
dtype=np.dtype([('layers', 'U256'), ('hours', datetime.datetime), ('points', 'U256')]))
I get an object like this :
[('image1.jpg', datetime.datetime(1900, 1, 1, 21, 20), 'mypoints.points')
('image2.jpg', datetime.datetime(1900, 1, 1, 21, 15), 'mypoints.points')]
with recarray type. I want to sort my recarray based on the second column containing datetime. I tried numpy.recarray.sort but it returns a NoneType object. I use it like this :
mytable.sort(order='hours')
I also tried to pass kind='quicksort' to the function but doesn't understand its usefulness.
I tried to reproduce your data
x1=np.array(['image1.jpg', 'image2.jpg'])
x2=np.array([datetime.datetime(1900, 1, 1, 21, 10), datetime.datetime(1900, 1, 1, 21, 9)])
x3=np.array(['mypoints.points', 'mypoints.points'])
array = np.rec.fromarrays([x1, x2, x3], dtype=np.dtype([('layers', 'U256'), ('hours', datetime.datetime), ('points', 'U256')]))
Output:
rec.array([('image1.jpg', datetime.datetime(1900, 1, 1, 21, 20), 'mypoints.points'),
('image2.jpg', datetime.datetime(1900, 1, 1, 21, 15), 'mypoints.points')],
dtype=[('layers', '<U256'), ('hours', 'O'), ('points', '<U256')])
But was not able to get same error... array.sort(order='hours') works fine
rec.array([('image2.jpg', datetime.datetime(1900, 1, 1, 21, 15), 'mypoints.points'),
('image1.jpg', datetime.datetime(1900, 1, 1, 21, 20), 'mypoints.points')],
dtype=[('layers', '<U256'), ('hours', 'O'), ('points', '<U256')])

Python 3: IndexError: list index out of range while doing Knapsack Problem

I am currently self-learning python for a career change. While doing some exercises about 'list', I encountered IndexError: list index out of range.
So, I am trying to build a function, that determines which product should be placed on my store's shelves. But, I also put constraints.
The shelve has a max capacity of 200
small-sized items should be placed first
if two or more items have the same size, the item with the highest price should be placed first
As an input for the function, I have a list of tuples "dairy_items", denoted as [(id, size, price)].
This is my code:
capacity=200
dairy_items=[('p1', 10, 3), ('p2', 13, 5),
('p3', 15, 2), ('p4', 26, 2),
('p5', 18, 6), ('p6', 25, 3),
('p7', 20, 4), ('p8', 10, 5),
('p9', 15, 4), ('p10', 12, 7),
('p11', 19, 3), ('p12', 27, 6),
('p13', 16, 4), ('p14', 23, 5),
('p15', 14, 2), ('p16', 23, 5),
('p17', 12, 7), ('p18', 11, 3),
('p19', 16, 5), ('p20', 11, 4)]
def shelving(dairy_items):
#first: sort the list of tuples based on size: low-to-big
items = sorted(dairy_items, key=lambda x: x[1], reverse=False)
#second: iterate the sorted list of tuples.
#agorithm: retrieve the first 2 elements of the sorted list
#then compare those two elements by applying rules/conditions as stated
#the 'winning' element is placed to 'result' and this element is removed from 'items'. Also 'temp' list is resetted
#do again untill shelves cannot be added anymore (capacity full and do not exceeds limit)
result = []
total_price = []
temp_capacity = []
temp = items[:2]
while sum(temp_capacity) < capacity:
#add conditions: (low first) and (if size the same, highest price first)
if (temp[0][1] == temp[1][1]) and (temp[0][2] > temp[1][2]):
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
elif ((temp[0][1] == temp[1][1])) and (temp[0][2] < temp[1][2]):
temp_capacity.append(temp[1][1])
result.append(temp.pop())
items.pop()
temp.clear()
temp = items[:2]
total_price.append(temp[1][2])
else:
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
result = result.append(temp_capacity)
#return a tuple with three elements: ([list of product ID to be placed in order], total occupied capacity of shelves, total prices)
return result
c:\Users\abc\downloads\listexercise.py in <module>
----> 1 print(shelving(dairy_items))
c:\Users\abc\downloads\listexercise.py in shelving(dairy_items)
28 while sum(temp_capacity) < capacity:
29
---> 30 if (temp[0][1] == temp[1][1]) and (temp[0][2] > temp[1][2]):
31 temp_capacity.append(temp[0][1])
32 result.append(temp2.pop(0))
IndexError: list index out of range
EDIT:
This is the expected result:
#Result should be True
print(shelving(dairy_items) == (['p8', 'p1', 'p20', 'p18', 'p10', 'p17', 'p2', 'p15', 'p9', 'p3', 'p19', 'p13', 'p5', 'p11'], 192, 60))
The IndexError occured because, you had tried to append the 2nd element after popping it from temp because, after popping it out, there will be only one element in temp which can indexed with 0.
Also I noticed a few more bugs which could hinder your program from giving the correct output and rectified them.
The following code will work efficiently...
from time import time
start = time()
capacity = 200
dairy_items = [('p1', 10, 3), ('p2', 13, 5),
('p3', 15, 2), ('p4', 26, 2),
('p5', 18, 6), ('p6', 25, 3),
('p7', 20, 4), ('p8', 10, 5),
('p9', 15, 4), ('p10', 12, 7),
('p11', 19, 3), ('p12', 27, 6),
('p13', 16, 4), ('p14', 23, 5),
('p15', 14, 2), ('p16', 23, 5),
('p17', 12, 7), ('p18', 11, 3),
('p19', 16, 5), ('p20', 11, 4)]
def shelving(dairy_items):
items = sorted(dairy_items, key=lambda x: x[1])
result = ([],)
total_price, temp_capacity = 0, 0
while (temp_capacity+items[0][1]) < capacity:
temp = items[:2]
if temp[0][1] == temp[1][1]:
if temp[0][2] > temp[1][2]:
temp_capacity += temp[0][1]
result[0].append(temp[0][0])
total_price += temp[0][2]
items.pop(0)
elif temp[0][2] < temp[1][2]:
temp_capacity += temp[1][1]
result[0].append(temp[1][0])
total_price += temp[1][2]
items.pop(items.index(temp[1]))
else:
temp_capacity += temp[0][1]
result[0].append(temp[0][0])
total_price += temp[0][2]
items.pop(0)
else:
temp_capacity += temp[0][1]
result[0].append(temp[0][0])
total_price += temp[0][2]
items.pop(0)
result += (temp_capacity, total_price)
return result
a = shelving(dairy_items)
end = time()
print(a)
print(f"\nTime Taken : {end-start} secs")
Output:-
(['p8', 'p1', 'p20', 'p18', 'p10', 'p17', 'p2', 'p15', 'p9', 'p3', 'p19', 'p13', 'p5', 'p11'], 192, 60)
Time Taken : 3.123283386230469e-05 secs
Not sure what the question is, but the following information may be relevant:
IndexError occurs when a sequence subscript is out of range. What does this mean? Consider the following code:
l = [1, 2, 3]
a = l[0]
This code does two things:
Define a list of 3 integers called l
Assigns the first element of l to a variable called a
Now, if I were to do the following:
l = [1, 2, 3]
a = l[3]
I would raise an IndexError, as I'm accessing the fouth element of a three element list. Somewhere in your code, you're likely over-indexing your list. This is a good chance to learn about debugging using pdg. Throw a call to breakpoint() in your code and inspect the variables, good luck!
ok, firstly, you should debug your code, if you print temp before adding temp[1][2] to total_price you would see that the last index is what causing the error, the example is here:
capacity=200
dairy_items=[('p1', 10, 3), ('p2', 13, 5),
('p3', 15, 2), ('p4', 26, 2),
('p5', 18, 6), ('p6', 25, 3),
('p7', 20, 4), ('p8', 10, 5),
('p9', 15, 4), ('p10', 12, 7),
('p11', 19, 3), ('p12', 27, 6),
('p13', 16, 4), ('p14', 23, 5),
('p15', 14, 2), ('p16', 23, 5),
('p17', 12, 7), ('p18', 11, 3),
('p19', 16, 5), ('p20', 11, 4)]
def shelving(dairy_items):
#first: sort the list of tuples based on size: low-to-big
items = sorted(dairy_items, key=lambda x: x[1], reverse=False)
#second: iterate the sorted list of tuples.
#agorithm: retrieve the first 2 elements of the sorted list
#then compare those two elements by applying rules/conditions as stated
#the 'winning' element is placed to 'result' and this element is removed from 'items'. Also 'temp' list is resetted
#do again untill shelves cannot be added anymore (capacity full and do not exceeds limit)
result = []
total_price = []
temp_capacity = []
temp = items[:2]
while sum(temp_capacity) < capacity:
#add conditions: (low first) and (if size the same, highest price first)
if (temp[0][1] == temp[1][1]) and (temp[0][2] > temp[1][2]):
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
elif ((temp[0][1] == temp[1][1])) and (temp[0][2] < temp[1][2]):
temp_capacity.append(temp[1][1])
result.append(temp.pop())
items.pop()
temp.clear()
temp = items[:2]
print(temp) # -----------NEW LINE ADDED TO DEBUG YOUR CODE
total_price.append(temp[1][2])
else:
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
result = result.append(temp_capacity)
#return a tuple with three elements: ([list of product ID to be placed in order], total occupied capacity of shelves, total prices)
return result
shelving(dairy_items)
the result i am getting is:
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3)]
Traceback (most recent call last):
File "<string>", line 55, in <module>
File "<string>", line 44, in shelving
IndexError: list index out of range
>
as you see clearly last index [('p1', 10, 3)] has only 1 tuple, hence the IndexError

Construct date based on week_of_month and day_of_week criteria

I am not sure how to go about constructing datetime object given year, month, week_of_month and day_of_week. Any clues? Using this I am trying to achieve following:
From (start_month, start_year) to (end_month, end_year) find monthly dates as specified by week_of_month and day_of_week parameters. Here 1 <= week_of_month <= 5 and 1 <= day_of_week <= 7. Now,
Each month may not have 5 weeks (eg. February in non-leap year)
1st and 5th week may not have 7 days.
In such cases, based on boolean is_to_next_day, if True then specify next calendar day, if False then skip it.
Sample input/outputs:
Input parameters: start_month=1 start_year=2020, end_month=12, end_year=2020, week_of_month=5, day_of_week=3, is_to_next_day=True
Desired output: [datetime(2020, 1, 29), datetime(2020, 2, 26), datetime(2020, 3, 25), datetime(2020, 4, 29), datetime(2020, 5, 27), datetime(2020, 7, 1), datetime(2020, 7, 29), datetime(2020, 8, 26), datetime(2020, 9, 30), datetime(2020, 10, 28), datetime(2020, 11, 25), datetime(2020, 12, 30)]
Input parameters: start_month=1 start_year=2020, end_month=12, end_year=2020, week_of_month=5, day_of_week=3, is_to_next_day=False
Desired output: [datetime(2020, 1, 29), datetime(2020, 2, 26), datetime(2020, 3, 25), datetime(2020, 4, 29), datetime(2020, 5, 27), datetime(2020, 7, 29), datetime(2020, 8, 26), datetime(2020, 9, 30), datetime(2020, 10, 28), datetime(2020, 11, 25), datetime(2020, 12, 30)]
import calendar
from datetime import datetime
def get_date(year, month, week_of_month, day_of_week, is_to_next_day):
mnth = calendar.monthcalendar(year, month)
if (week_of_month > 1) and (week_of_month < 5):
day = mnth[week_of_month - 1][day_of_week - 1]
return datetime(year, month, day)
elif week_of_month == 1:
last_day_of_first_week = mnth[0][6]
if day_of_week <= last_day_of_first_week:
return datetime(year, month, day_of_week)
elif is_to_next_day:
return datetime(year, month, mnth[1][0])
else:
return None
else:
if (len(mnth) >= week_of_month):
day = mnth[week_of_month - 1][day_of_week - 1]
if(day==0) and is_to_next_day:
return datetime(year + int((month + 1)/12), (month + 1)%12, 1)
elif(day==0):
return None
else:
return datetime(year, month, day)
if (len(mnth) < week_of_month):
if is_to_next_day:
return datetime(year + int((month + 1)/12), (month + 1)%12, 1)
else:
return None
# First output
[get_date(yy, mm, 5, 3, True) for mm in range(1, 13) for yy in [2020]]
# Second output
[get_date(yy, mm, 5, 3, False) for mm in range(1, 13) for yy in [2020]] # Iterate again to drop None.

Summing over an index in Python

I have a large index list and a large parameter dictionary. The len of some_dict is almost 1 million. I am trying to create a new dictionary using this 4-indexed dictionary and gather it into 3-indices. I also have the list RPS containing the list of indices (in 3-tuple format), which has a len of ~0.5 million. This current loop has not ended after some good while. Is there any Pythonic trick to boost things up?
mynewdict= {(r,p,s): sum(some_dict[r,p,s,t]/another_dict[t]
for (r,p,s,t) in RPST) for (r,p,s) in RPS}
For a minimal example, let:
RPS = [(1, 13, 37),
(1, 13, 38),
(1, 13, 39),
(1, 13, 40)]
RPST = [(1, 13, 37, 9027),
(1, 13, 37, 9028),
(1, 13, 37, 9058),
(1, 13, 38, 9027),
(1, 13, 38, 9028),
(1, 13, 38, 9058),
(1, 13, 39, 9027),
(1, 13, 39, 9028),
(1, 13, 40, 9027),
(1, 13, 40, 9028)]
some_dict = { (1, 13, 37, 9027): 1,
(1, 13, 37, 9028): 1,
(1, 13, 37, 9058): 1,
(1, 13, 38, 9027): 1,
(1, 13, 38, 9028): 1,
(1, 13, 38, 9058): 1,
(1, 13, 39, 9027): 1,
(1, 13, 39, 9028): 1,
(1, 13, 40, 9027): 1,
(1, 13, 40, 9028): 1}
another_dict = {9027: 2, 9028: 2, 9058: 2}
Here is my solution for future reference:
RPSTdict = {(r,p,s):[] for (r,p,s) in RPS}
for (r,p,s,t) in RPST:
RPSTdict[r,p,s].append(t)
mynewdict = {(r,p,s): sum(some_dict[r,p,s,t]/another_dict[t]
for t in RPSTdict[r,p,s]) for (r,p,s) in RPS}

Collapse tuple column by first tuple element

:)
I have this dataset:
r = pd.DataFrame({'duplicates': [ [("007", "us1", "us2", 7, 1), ("001", "us1", "us2", 9, 8), ("009", "us1", "us2", 28, 27)], ("007", "us2", "us1", 8, 15), ("009", "us4", "us1", 29, 30), ("009", "us4", "us1", 29, 30)],
'id': ["b", 'c', 'b', "c"]})
duplicates id
0 [(007, us1, us2, 7, 1), (001, us1, us2, 9, 8), (009, us1, us2, 28, 27)] b
1 (007, us2, us1, 8, 15) c
2 (009, us4, us1, 29, 30) b
3 (009, us4, us1, 29, 30) c
Here the tuples are grouped based on the us1, us2 [...] order. So the tuples are in the same row if they have the same 'id' and the same sequence of users. On the first, line, for example, us2 accessed record 007 on the time 7. Also, us1 accessed record 007 on the time 01.
What I want is to have this:
j = pd.DataFrame({'duplicates': [ ("007", ['us2', 'us1', 'us2', 'us1'], 1, 7, 8, 15), ("001", ['us2', 'us1'], 8, 9), ("009", ['us2', 'us1', 'us4', 'us1'], 27, 28, 29, 30), ("009", ['us4', 'us1'], 29, 30)],
'id': ["b", "b", 'b', "c"]})
duplicates id
0 (007, [us2, us1, us2, us1], 1, 7, 8, 15) b
1 (001, [us2, us1], 8, 9) b
2 (009, [us2, us1, us4, us1], 27, 28, 29, 30) b
3 (009, [us4, us1], 29, 30) c
On this case I want to group by id and by the first parts of the tuple. As an example, of the first row, I've used '007' and the 'id' as the keys and then I add the users based on the time access. So us2 is before us1 because us2 accessed on the time 1 and us1 accessed on the time 7, and 1 < 7.
This is what I have until now, but it's very far away of the result, and I don't know what do to:
r.explode('duplicates').groupby(['id', r['duplicates'].str[0]])['duplicates'].apply(list).reset_index(level=1, drop=True).reset_index()
Thank you a lot!

Resources