PySpark RDD processing for sum of parts - apache-spark

I have a RDD with tuples like (datetime, integer).
And I try to get another RDD of some interval summation with pyspark.
For example, from followings
(2015-09-30 10:00:01, 3)
(2015-09-30 10:00:02, 1)
(2015-09-30 10:00:05, 2)
(2015-09-30 10:00:06, 7)
(2015-09-30 10:00:07, 3)
(2015-09-30 10:00:10, 5)
I'm trying to get followings sum of every 3 seconds:
(2015-09-30 10:00:01, 4) # sum of 1, 2, 3 seconds
(2015-09-30 10:00:02, 1) # sum of 2, 3, 4 seconds
(2015-09-30 10:00:05, 12) # sum of 5, 6, 7 seconds
(2015-09-30 10:00:06, 10) # sum of 6, 7, 8 seconds
(2015-09-30 10:00:07, 3) # sum of 7, 8, 9 seconds
(2015-09-30 10:00:10, 5) # sum of 10, 11, 12 seconds
Please, could you give me any hints?

I will assume that your input is an RDD time_rdd with tuples where the first element is a datetime object and the second element is an integer. You could use a flatMap to map every datetime object to the previous 3 seconds and then use a reduceByKey to get the total count for that window.
def map_to_3_seconds(datetime_obj, count):
list_times = []
for i in range(-2, 1):
list_times.append((datetime_obj + timedelta(seconds = i), count))
return list_times
output_rdd = time_rdd.flatMap(lambda (datetime_obj, count): map_to_3_seconds(datetime_obj, count)).reduceByKey(lambda x,y: x+y)
This RDD will contain more datetime objects than the ones in the original RDD, so if you only want to have the original times, then you need to do a join with the time_rdd,
result = output_rdd.join(time_rdd).map(lambda (key, vals): (key, vals[0])).collect()
Now result will contain:
[(datetime.datetime(2015, 9, 30, 10, 0, 5), 12),
(datetime.datetime(2015, 9, 30, 10, 0, 2), 1),
(datetime.datetime(2015, 9, 30, 10, 0, 10), 5),
(datetime.datetime(2015, 9, 30, 10, 0, 1), 4),
(datetime.datetime(2015, 9, 30, 10, 0, 6), 10),
(datetime.datetime(2015, 9, 30, 10, 0, 7), 3)]

Related

Python 3: IndexError: list index out of range while doing Knapsack Problem

I am currently self-learning python for a career change. While doing some exercises about 'list', I encountered IndexError: list index out of range.
So, I am trying to build a function, that determines which product should be placed on my store's shelves. But, I also put constraints.
The shelve has a max capacity of 200
small-sized items should be placed first
if two or more items have the same size, the item with the highest price should be placed first
As an input for the function, I have a list of tuples "dairy_items", denoted as [(id, size, price)].
This is my code:
capacity=200
dairy_items=[('p1', 10, 3), ('p2', 13, 5),
('p3', 15, 2), ('p4', 26, 2),
('p5', 18, 6), ('p6', 25, 3),
('p7', 20, 4), ('p8', 10, 5),
('p9', 15, 4), ('p10', 12, 7),
('p11', 19, 3), ('p12', 27, 6),
('p13', 16, 4), ('p14', 23, 5),
('p15', 14, 2), ('p16', 23, 5),
('p17', 12, 7), ('p18', 11, 3),
('p19', 16, 5), ('p20', 11, 4)]
def shelving(dairy_items):
#first: sort the list of tuples based on size: low-to-big
items = sorted(dairy_items, key=lambda x: x[1], reverse=False)
#second: iterate the sorted list of tuples.
#agorithm: retrieve the first 2 elements of the sorted list
#then compare those two elements by applying rules/conditions as stated
#the 'winning' element is placed to 'result' and this element is removed from 'items'. Also 'temp' list is resetted
#do again untill shelves cannot be added anymore (capacity full and do not exceeds limit)
result = []
total_price = []
temp_capacity = []
temp = items[:2]
while sum(temp_capacity) < capacity:
#add conditions: (low first) and (if size the same, highest price first)
if (temp[0][1] == temp[1][1]) and (temp[0][2] > temp[1][2]):
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
elif ((temp[0][1] == temp[1][1])) and (temp[0][2] < temp[1][2]):
temp_capacity.append(temp[1][1])
result.append(temp.pop())
items.pop()
temp.clear()
temp = items[:2]
total_price.append(temp[1][2])
else:
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
result = result.append(temp_capacity)
#return a tuple with three elements: ([list of product ID to be placed in order], total occupied capacity of shelves, total prices)
return result
c:\Users\abc\downloads\listexercise.py in <module>
----> 1 print(shelving(dairy_items))
c:\Users\abc\downloads\listexercise.py in shelving(dairy_items)
28 while sum(temp_capacity) < capacity:
29
---> 30 if (temp[0][1] == temp[1][1]) and (temp[0][2] > temp[1][2]):
31 temp_capacity.append(temp[0][1])
32 result.append(temp2.pop(0))
IndexError: list index out of range
EDIT:
This is the expected result:
#Result should be True
print(shelving(dairy_items) == (['p8', 'p1', 'p20', 'p18', 'p10', 'p17', 'p2', 'p15', 'p9', 'p3', 'p19', 'p13', 'p5', 'p11'], 192, 60))
The IndexError occured because, you had tried to append the 2nd element after popping it from temp because, after popping it out, there will be only one element in temp which can indexed with 0.
Also I noticed a few more bugs which could hinder your program from giving the correct output and rectified them.
The following code will work efficiently...
from time import time
start = time()
capacity = 200
dairy_items = [('p1', 10, 3), ('p2', 13, 5),
('p3', 15, 2), ('p4', 26, 2),
('p5', 18, 6), ('p6', 25, 3),
('p7', 20, 4), ('p8', 10, 5),
('p9', 15, 4), ('p10', 12, 7),
('p11', 19, 3), ('p12', 27, 6),
('p13', 16, 4), ('p14', 23, 5),
('p15', 14, 2), ('p16', 23, 5),
('p17', 12, 7), ('p18', 11, 3),
('p19', 16, 5), ('p20', 11, 4)]
def shelving(dairy_items):
items = sorted(dairy_items, key=lambda x: x[1])
result = ([],)
total_price, temp_capacity = 0, 0
while (temp_capacity+items[0][1]) < capacity:
temp = items[:2]
if temp[0][1] == temp[1][1]:
if temp[0][2] > temp[1][2]:
temp_capacity += temp[0][1]
result[0].append(temp[0][0])
total_price += temp[0][2]
items.pop(0)
elif temp[0][2] < temp[1][2]:
temp_capacity += temp[1][1]
result[0].append(temp[1][0])
total_price += temp[1][2]
items.pop(items.index(temp[1]))
else:
temp_capacity += temp[0][1]
result[0].append(temp[0][0])
total_price += temp[0][2]
items.pop(0)
else:
temp_capacity += temp[0][1]
result[0].append(temp[0][0])
total_price += temp[0][2]
items.pop(0)
result += (temp_capacity, total_price)
return result
a = shelving(dairy_items)
end = time()
print(a)
print(f"\nTime Taken : {end-start} secs")
Output:-
(['p8', 'p1', 'p20', 'p18', 'p10', 'p17', 'p2', 'p15', 'p9', 'p3', 'p19', 'p13', 'p5', 'p11'], 192, 60)
Time Taken : 3.123283386230469e-05 secs
Not sure what the question is, but the following information may be relevant:
IndexError occurs when a sequence subscript is out of range. What does this mean? Consider the following code:
l = [1, 2, 3]
a = l[0]
This code does two things:
Define a list of 3 integers called l
Assigns the first element of l to a variable called a
Now, if I were to do the following:
l = [1, 2, 3]
a = l[3]
I would raise an IndexError, as I'm accessing the fouth element of a three element list. Somewhere in your code, you're likely over-indexing your list. This is a good chance to learn about debugging using pdg. Throw a call to breakpoint() in your code and inspect the variables, good luck!
ok, firstly, you should debug your code, if you print temp before adding temp[1][2] to total_price you would see that the last index is what causing the error, the example is here:
capacity=200
dairy_items=[('p1', 10, 3), ('p2', 13, 5),
('p3', 15, 2), ('p4', 26, 2),
('p5', 18, 6), ('p6', 25, 3),
('p7', 20, 4), ('p8', 10, 5),
('p9', 15, 4), ('p10', 12, 7),
('p11', 19, 3), ('p12', 27, 6),
('p13', 16, 4), ('p14', 23, 5),
('p15', 14, 2), ('p16', 23, 5),
('p17', 12, 7), ('p18', 11, 3),
('p19', 16, 5), ('p20', 11, 4)]
def shelving(dairy_items):
#first: sort the list of tuples based on size: low-to-big
items = sorted(dairy_items, key=lambda x: x[1], reverse=False)
#second: iterate the sorted list of tuples.
#agorithm: retrieve the first 2 elements of the sorted list
#then compare those two elements by applying rules/conditions as stated
#the 'winning' element is placed to 'result' and this element is removed from 'items'. Also 'temp' list is resetted
#do again untill shelves cannot be added anymore (capacity full and do not exceeds limit)
result = []
total_price = []
temp_capacity = []
temp = items[:2]
while sum(temp_capacity) < capacity:
#add conditions: (low first) and (if size the same, highest price first)
if (temp[0][1] == temp[1][1]) and (temp[0][2] > temp[1][2]):
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
elif ((temp[0][1] == temp[1][1])) and (temp[0][2] < temp[1][2]):
temp_capacity.append(temp[1][1])
result.append(temp.pop())
items.pop()
temp.clear()
temp = items[:2]
print(temp) # -----------NEW LINE ADDED TO DEBUG YOUR CODE
total_price.append(temp[1][2])
else:
temp_capacity.append(temp[0][1])
result.append(temp.pop(0))
items.pop(0)
temp.clear()
temp = items[:2]
total_price.append(temp[0][2])
result = result.append(temp_capacity)
#return a tuple with three elements: ([list of product ID to be placed in order], total occupied capacity of shelves, total prices)
return result
shelving(dairy_items)
the result i am getting is:
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3), ('p8', 10, 5)]
[('p1', 10, 3)]
Traceback (most recent call last):
File "<string>", line 55, in <module>
File "<string>", line 44, in shelving
IndexError: list index out of range
>
as you see clearly last index [('p1', 10, 3)] has only 1 tuple, hence the IndexError

Change a future date every five days

I am working on a point of sale app in Django in which the customer books a product and it is delivered in 45 days. I can get the delivery date while booking using the following:
from datetime import datetime, timedelta
DELIVERY_IN_DAYS = 45
delivery_on = datetime.today() + timedelta(days=DELIVERY_IN_DAYS)
delivery_on = delivery_on.strftime('%Y-%m-%d')
now I want the delivery_on to remain same for 5 days and change on the 6th day. can I do it without using a background celery job?
Thanks in advance.
Yes, we can determine a date modulo the number of days with:
from datetime import date, timedelta
today = date.today()
offset_date = date(2000, 1, 1)
dt = today - offset_date
delivery_on = today + timedelta(days=DELIVERY_IN_DAYS - dt.days % 5)
We can wrap the logic into a function:
def to_deliver_day(day):
offset_date = date(2000, 1, 1)
dt = day - offset_date
return day + timedelta(days=DELIVERY_IN_DAYS - dt.days % 5)
If we call this logic on July 1st until July 10th, we get:
>>> to_deliver_day(date(2021, 7, 1))
datetime.date(2021, 8, 13)
>>> to_deliver_day(date(2021, 7, 2))
datetime.date(2021, 8, 13)
>>> to_deliver_day(date(2021, 7, 3))
datetime.date(2021, 8, 13)
>>> to_deliver_day(date(2021, 7, 4))
datetime.date(2021, 8, 18)
>>> to_deliver_day(date(2021, 7, 5))
datetime.date(2021, 8, 18)
>>> to_deliver_day(date(2021, 7, 6))
datetime.date(2021, 8, 18)
>>> to_deliver_day(date(2021, 7, 7))
datetime.date(2021, 8, 18)
>>> to_deliver_day(date(2021, 7, 8))
datetime.date(2021, 8, 18)
>>> to_deliver_day(date(2021, 7, 9))
datetime.date(2021, 8, 23)
>>> to_deliver_day(date(2021, 7, 10))
datetime.date(2021, 8, 23)
By setting the offset_date differently, you can change the moment when the result makes a "jump"

How to concatenate list to int in Python?

When using a list, I saw that I cannot add or subtract the sample I took from the list. For example:
import random
x = random.sample ((1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13), k=1 )
print(x + 1)
Why I can’t add into the list I created and how can I get around that issue?
If you want to increase the value of every item in a list, you can do like:
import random
x = random.sample ((1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13), k=3 )
print(x)
for index in range(len(x)):
x[index] = x[index] +1
print(x)
In your case, if k is always 1, you can simply like:
import random
x = random.sample ((1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13), k=1 )
print(x)
x[0] = x[0] + 1
print(x)
The reason you can't concatenate is because the type random.sample is returning is a list of size k=1. If you want to be returning an element of your sequence and add to it, you should be using random.choice. It should read something along the lines of:
import random
x = random.choice((1,2,3,4,5,6,7,8,9,10,11,12,13))
print(x+1)

get multiple tuples from list of tuples using min function

I have a list that looks like this
mylist = [('Part1', 5, 5), ('Part2', 7, 7), ('Part3', 11, 9),
('Part4', 45, 45), ('part5', 5, 5)]
I am looking for all the tuples that has a number closest to my input
now i am using this code
result = min([x for x in mylist if x[1] >= 4 and x[2] >= 4])
The result i am getting is
('part5', 5, 5)
But i am looking for an result looking more like
[('Part1', 5, 5), ('part5', 5, 5)]
and if there are more tuples in it ( i have 2 in this example but it could be more) then i would like to get all the tuples back
the whole code
mylist = [('Part1', 5, 5), ('Part2', 7, 7), ('Part3', 11, 9), ('Part4', 45, 45), ('part5', 5, 5)]
result = min([x for x in mylist if x[1] >= 4 and x[2] >= 4])
print(result)
threshold = 4
mylist = [('Part1', 5, 5), ('Part2', 7, 7), ('Part3', 11, 9), ('Part4', 45, 45), ('part5', 5, 5)]
filtered = [x for x in mylist if x[1] >= threshold and x[2] >= threshold]
keyfunc = lambda x: x[1]
my_min = keyfunc(min(filtered, key=keyfunc))
result = [v for v in filtered if keyfunc(v)==my_min]
# [('Part1', 5, 5), ('part5', 5, 5)]

How to define functions from an index set to an indefinite set of domains?

I want to define sth like
list([[i0,i1,i2,i3, ..., ik]] for i0 in T[0] for i1 in T[1] for i2 in T[2] for i3 in T[3] for ...)
as k is indefinite, I cannot do this like
list([[i0,i1,i2,i3]] for i0 in T[0] for i1 in T[1] for i2 in T[2] for i3 in T[3]).
Is there a general solution?
Many thanks!
Your nested fors will make a Cartesian product of the sublists in T. Itertools has a product() function that will give you an iterator of these values, which you can use like:
from itertools import product
T = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10,11, 12]]
p = product(*T)
for i in p:
print(i)
(1, 4, 7, 10)
(1, 4, 7, 11)
(1, 4, 7, 12)
(1, 4, 8, 10)
(1, 4, 8, 11)
(1, 4, 8, 12)
(1, 4, 9, 10)
(1, 4, 9, 11)
...
(3, 6, 9, 10)
(3, 6, 9, 11)
(3, 6, 9, 12)
Of course you can also pass it to list() if want the values in a list.

Resources