With the property and setter decorator I can define getter and setter functions. This is fine for primitives but how do I index a collection or a numpy array?
Setting values seems to work with an index, but the setter function doesn't get called. Otherwise the print function in the minimal example would be executed.
class Data:
def __init__(self):
self._arr = [0, 1, 2]
#property
def arr(self):
return self._arr
#arr.setter
def arr(self, value):
print("new value set") # I want this to be executed
self._arr = value
data = Data()
print(data.arr) # prints [0, 1, 2]
data.arr[2] = 5
print(data.arr) # prints [0, 1, 5]
If you want to do this just for one list of your class instance you can do this in a way by using the __set_item__ and __get_item__ dunder methods of the class:
class Data:
def __init__(self):
self._arr = [0, 1, 2]
#property
def arr(self):
return self._arr
#arr.setter
def arr(self, value):
print("new inner list set")
self._arr = value
def __setitem__(self, key, value):
print("new value set")
self._arr[key] = value
def __getitem__(self, key):
return self._arr[key]
data = Data()
print(data.arr)
data[2] = 5
print(data.arr)
data.arr = [42, 43]
print(data.arr)
Output:
[0, 1, 2]
new value set # by data[2] = 5 using __set_item__
[0, 1, 5]
new inner list set # by data.arr = [42, 43] using #arr.setter
[42, 43]
This would only work for one list member though, because the __set_item__ are working on the class instance itself, not the list that is a member of the class instance.
start = [2020,0,0,2020]
jaunts = [[2020,0,0,2021],[2021,0,0,2022],[2022,0,0,2023],[2020,1,1,2023],[2021,0,0,2023]]
def gridneighbors(start,jaunts):
neigh = []
for o in jaunts:
new_cell = o
if start[0]==o[0] and (start[1] == o[1] and start[2] == o[2]):
new_cell[0]=o[3]
neigh.append(o)
elif start[3]==o[3] and (start[1] == o[1] and start[2] == o[2]):
o[3]=o[0]
neigh.append(o)
print(jaunts)
return neigh
print(gridneighbors(start,jaunts))
output:
[[2021, 0, 0, 2021], [2021, 0, 0, 2022], [2022, 0, 0, 2023], [2020, 1, 1, 2023], [2021, 0,
0, 2023]]
This is the value of jaunts im getting, the 1st value has changed when ive not even updated it.
when you assign to a variable another variable that is a list you are actually creating a new reference to the same list so when you change something in the second variable you are changing also in the first variable because both represent the same list, ex:
first_list = [2020, 0,0, 2021]
second_list = first_list
second_list[0] = first_list[3]
print(first_list)
output:
[2021, 0, 0, 2021]
the same thing it is happening in your for loop in the first iteration, your new_cell and o variables are actually both the same jaunted[0]; when you are doing new_cell[0]=o[3] you are doing: jaunted[0][0] = jaunted[0][3]
I need to make a function which would compare each value in a list and then set each value accordingly. Code follows:
actions = [0, 0, 0, 0.5, 0, 0.3, 0.8, 0, 0.00000000156]
def treshold(element, value):
if element >= value:
element == 1
else:
element == 0
treshold(actions, 0.5)
This code however results in the following error:
TypeError: '>=' not supported between instances of 'list' and 'float'
I understand what this error says, however I do not know how to fix that.
A compact way of doing this, as pointed out by user202729 is with a list comprehension. The key is, you need to do this for each entry into the list. If you want to run it on the whole list at once, you could consider using numpy
actions = [0, 0, 0, 0.5, 0, 0.3, 0.8, 0, 0.00000000156]
def treshold(element, value):
thresholded_list = [int(a>=value) for a in actions]
return thresholded_list
this function is essentially a shorthand for
def treshold_long(element_list, value):
thresholded_list = []
for element in element_list:
if element >= value:
thresholded_list.append(1)
else:
thresholded_list.append(0)
return thresholded_list
Thanks to user202729 I have discovered list comprehensions.
actions = [0, 0, 0, 0.5, 0, 0.3, 0.8, 0, 0.00000000156]
treshold = 0.5
actions = [1 if i>=treshold else 0 for i in actions]
print(actions)
This basically solves my problem. I also thank to user3235916 for a valid function.
Help please!!
I was trying to create a column 'Segment' based on the condition:
if 'Pro_vol' >1 and 'Cost' >=43 then append 1
if 'Pro_vol' ==1 and 'Cost' >=33 then append 1
or append 0
Below is the code for data:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Pro_vol':[1,2,3,1,5,1,2,1,4,5],
'Cost' : [12.34,13.55,34.00, 19.15,13.22,22.34,33.55,44.00, 29.15,53.22]})
I tried a code:
Segment=[]
for i in df['Pro_vol']:
if i >1:
Segment.append(1)
for j in df['Cost']:
if j>=43:
Segment.append(1)
elif i==1:
Segment.append(1)
elif j>=33:
Segment.append(1)
else:
Segment.append(0)
df['Segment']=Segment
And it was giving me an error:
ValueError: Length of values does not match length of index
I don't know any other way to try to find an answer!!
You may consider np.where
np.where(((df.Cost>=33)&(df.Pro_vol==1))|((df.Cost>=43)&(df.Pro_vol>1)),1,0)
Out[538]: array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1])
I have Spark application which contains the following segment:
val repartitioned = rdd.repartition(16)
val filtered: RDD[(MyKey, myData)] = MyUtils.filter(repartitioned, startDate, endDate)
val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2))
val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
When I run this with some logging this is what I see:
repartitioned ======> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
filtered ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
mapped ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
reduced ======> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
My logging is done using these two lines:
val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)
log.info(s"rdd ======> [${sizes.collect().toList}]")
My question is why does my data end up in one partition after the reduceByKey? After the filter it can be seen that the data is evenly distributed, but the reduceByKey results in data in only one partition.
I am guessing all your processing times are the same.
Alternatively, their hashCode (from the DateTime class) are the same. Is that a custom class ?
I will answer my own question, since I figured it out. My DateTimes were all without seconds and milliseconds since I wanted to group data belonging to the same minute. The hashCode() for Joda DateTimes which are one minute apart is a constant:
scala> val now = DateTime.now
now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
res42: Int = 60000
As can be seen by this example, if the hashCode values are similarly spaced, they can end up in the same partition:
scala> val nums = for(i <- 0 to 1000000) yield ((i*20 % 1000), i)
nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
scala> val rddNum = sc.parallelize(nums)
rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val reducedNum = rddNum.reduceByKey(_+_)
reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at reduceByKey at <console>:25
scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList
res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
To distribute my data more evenly across the partitions I created my own custom Partitoiner:
class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
def numPartitions: Int = rddNumPartitions
def getPartition(key: Any): Int = {
key match {
case dateTime: DateTime =>
val sum = dateTime.getYear + dateTime.getMonthOfYear + dateTime.getDayOfMonth + dateTime.getMinuteOfDay + dateTime.getSecondOfDay
sum % numPartitions
case _ => 0
}
}
}