I've seen similar questions and answers on SO, but i'm struggling to understand how to apply it.
I am trying to port the following Python 2x code to Python 3x:
deals = sorted([DealData(deal) for deal in deals],
lambda f1, f2: f1.json_data['time'] > f2.json_data['time]
I've seen suggestions to use the cmp_to_key function, but i can't get it working. What am I missing?
This is my attempt with CMP_to_key:
deals = sorted(DealData, key=functools.cmp_to_key(cmp=compare_timestamps))
def compare_timestamps(x,y):
return x.json_data['timeStamp'] > y.json_data['timeStamp']
I receive the following error: cmp_to_key() missing required argument 'mycmp'(pos1)
For sorted in python 3 you need to tell it what key in the object to use for sorting
deals = sorted(
[DealData(deal) for deal in deals],
key=lambda deal_data: deal_data.json_data["time"]
)
cmp_to_key is only needed if you had an existing comparison function ie:
from functools import cmp_to_key
def compare_deals(d1, d2):
if d1.json_data["time"] > d2.json_data["time"]:
return 1
if d1.json_data["time"] < d2.json_data["time"]:
return -1
# equal
return 0
deal = sorted(
[DealData(deal) for deal in deals],
key=cmp_to_key(compare_deals)
)
The Sorting How To in the python documentation gives more examples.
Related
I'm trying to find the strings in two list that almost match. Suppose there are two list as below
string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']
string_list_2 =
['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']
Output
Similar = ['apple_from_2018','samsung_from_2017','htc_from_2015','lenovo_decommision_2017']
Not Similar =['nokia_from_2010','moto_from_2019']
I tried above one using below implementation but it is not giving proper result
similar = []
not_similar = []
for item1 in string_list_1:
for item2 in string_list_2:
if SequenceMatcher(a=item1,b=item2).ratio() > 0.90:
similar.append(item1)
else:
not_similar.append(item1)
When I tried above implementation it is not as expected. It would be appreciated if someone could identify the missing part and to get required result
You may make use of the following function in order to find similarity between two given strings
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
print(similar("apple_from_2018", "apple_from_2020"))
Output :
0.8666666666666667
Thus using this function you may select the strings which cross the threshold value of percentage similarity. Although you may need to reduce your threshold from 90 to maybe 85 in order to get the expected output.
Thus the following code should work fine for you
string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']
string_list_2 = ['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']
from difflib import SequenceMatcher
similar = []
not_similar = []
for item1 in string_list_1:
# Set the state as false
found = False
for item2 in string_list_2:
if SequenceMatcher(None, a=item1,b=item2).ratio() > 0.80:
similar.append(item1)
found = True
break
if not found:
not_similar.append(item1)
print("Similar : ", similar)
print("Not Similar : ", not_similar)
Output :
Similar : ['apple_from_2018', 'samsung_from_2017', 'htc_from_2015', 'lenovo_decommision_2017']
Not Similar : ['nokia_from_2010', 'moto_from_2019']
This does cut down on the amount of time and redundant appends. Also I have reduced the similarity measure to 80 since 90 was too high. But feel free to tweak the values.
I have this code:
def main():
if (len(sys.argv) > 2) :
P=list()
f= open('Trace.txt' , 'w+')
Seed = int(sys.argv[1])
for i in range(2, len(sys.argv)):
P[i-2] = int(sys.argv[i])
for j in range(0, len(sys.argv)-1) :
Probability=P[j]
for Iteration in (K*j, K*(j+1)):
Instruction= generateInstruction(Seed, Probability)
f.write(Instruction)
f.close()
else:
print('Params Error')
if __name__ == "__main__":
main()
The idea is that I am passing some parameters through the command line. the first is seed and the rest I want to have them in a list that I am parsing later and doing treatments according to that parameter.
I keep receiving this error:
P[i-2] = int(sys.argv[i])
IndexError: list assignment index out of range
what am I doing wrong
PS: K, generateSegment() are defined in a previous part of the code.
The error you see is related to a list being indexed with an invalid index.
Specifically, the problem is that P is an empty list at the time is being called in that line so P[0] is indeed not accessible. Perhaps what you want is to actually add the element to the list, this can be achieved, for example, by replacing:
P[i-2] = int(sys.argv[i])
with:
P.append(int(sys.argv[i]))
Note also that argument parsing is typically achieved way more efficiently in Python by using the standard module argparse, rather than parsing sys.argv manually.
It looks like you might be referencing a list item that does not exist.
I haven't used Python in quite a while but I'm pretty sure that if you want to add a value to the end of a list you can use someList.append(foo)
The problem is that you are assigning a value to an index which does not yet exist.
You need to replace
P[i-2] = int(sys.argv[I])
with
P.append(int(sys.argv[i]))
Furthermore, len(sys.argv) will return the number of items in sys.argv however indexing starts at 0 so you need to change:
for i in range(2, len(sys.argv)):
with
for i in range(2, len(sys.argv)-1):
As you will run into a list index out of range error otherwise
I'm using a map function to generate a new column where its value depends on the result of a column that already exists in the dataframe.
def computeTechFields(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return (row.col1, row.col2, row.col3, tech1)
delta2rdd = delta.map(computeTechFields)
The problem is that my main dataframe has more than 150 columns that I have to return with the map function so in the end I have something like this :
return (row.col1, row.col2, row.col3, row.col4, row.col5, row.col6, row.col7, row.col8, row.col9, row.col10, row.col11, row.col12, row.col13, row.col14, row.col15, row.col16, row.col17, row.col18 ..... row.col149, row.col150, row.col151, tech1)
As you can see, it is really long to write and difficult to read. So I tried to do something like this :
return (row.*, tech1)
But of course it did not work.
I know that the "withColumn" function exists but I don't know much about its performance and could not make it work anyway.
Edit (What happened with the withColumn function) :
def computeTech1(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return tech1
delta2 = delta.withColumn("tech1", computeTech1)
And it gave me this error :
AssertionError: col should be Column
I tried to do something like this :
return col(tech1)
The error was the same
I also tried :
delta2 = delta.withColumn("tech1", col(computeTech1))
This time, the error was :
AttributeError: 'function' object has no attribute '_get_object_id'
End of the edit
So my question is, how can I return all the columns + a few more within my UDF used by the map function ?
Thanks !
Not super firm with Python, so people might correct me on the syntax here, but the general idea is to make your function a UDF with a column as input, then call that inside withColumn. I used a lambda here, but with some fiddeling it should also work with a function.
from pyspark.sql.functions import udf
computeTech1UDF = udf(
lambda col: 0 if col != VALUE_TO_COMPARE else 1, IntegerType())
delta2 = delta.withColumn("tech1", computeTech1UDF(col1))
What you tried did not work since you did not provide withColumn with a column expression (see http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn). Using the UDF wrapper achieves exactly that.
I am new to python, trying to port a script in 2.x to 3.x i am encountering the error TypeError; Must use key word argument or key function in python 3.x. Below is the piece of code: Please help
def resort_working_array( self, chosen_values_arr, num ):
for item in self.__working_arr[num]:
data_node = self.__pairs.get_node_info( item )
new_combs = []
for i in range(0, self.__n):
# numbers of new combinations to be created if this item is appended to array
new_combs.append( set([pairs_storage.key(z) for z in xuniqueCombinations( chosen_values_arr+[item], i+1)]) - self.__pairs.get_combs()[i] )
# weighting the node
item.weights = [ -len(new_combs[-1]) ] # node that creates most of new pairs is the best
item.weights += [ len(data_node.out) ] # less used outbound connections most likely to produce more new pairs while search continues
item.weights += [ len(x) for x in reversed(new_combs[:-1])]
item.weights += [ -data_node.counter ] # less used node is better
item.weights += [ -len(data_node.in_) ] # otherwise we will prefer node with most of free inbound connections; somehow it works out better ;)
self.__working_arr[num].sort( key = lambda a,b: cmp(a.weights, b.weights) )
Looks like the problem is in this line.
self.__working_arr[num].sort( key = lambda a,b: cmp(a.weights, b.weights) )
The key callable should take only one argument. Try:
self.__working_arr[num].sort(key = lambda a: a.weights)
The exact same error message appears if you try to pass the key parameter as a positional parameter.
Wrong:
sort(lst, myKeyFunction)
Correct:
sort(lst, key=myKeyFunction)
Python 3.6.6
Following on from the answer by #Kevin - and more specifically the comment/question by #featuresky:
Using functools.cmp_to_key and reimplementing cmp (as noted in the porting guide) I have a hacky workaround for a scenario where 2 elements can be compared via lambda form. To use the OP as an example; instead of:
self.__working_arr[num].sort( key = lambda a,b: cmp(a.weights, b.weights) )
You can use this:
from functools import cmp_to_key
[...]
def cmp(x, y): return (x > y) - (x < y)
self.__working_arr[num].sort(key=cmp_to_key(lambda a,b: cmp(a.weights, b.weights)))
Admittedly, I'm somewhat new to python myself and don't really have a good handle on python2. I'm sure the code could be rewritten in a much better/cleaner way and I'd certainly love to hear a "proper" way to do this.
OTOH in my case this was a handy hack for a old python2 script (updated to python3) that I don't have time/energy to "properly" understand and rewrite right now.
Beyond the fact that it works, I would certainly not recommend wide usage of this hack! But I figured that it was worth sharing.
def shufflemode():
import random
combined = zip(question, answer)
random.shuffle(combined)
question[:], answer[:] = zip(*combined)
but then i get the error:
TypeError: object of type 'zip' has no len()
What do I do im so confused
I wonder the same thing. According to:
randomizing two lists and maintaining order in python
You should be able to do it like the OP tried, but i also get the same error. I think the ones from the link are using python 2 and not 3, could this be the problem?
This is an issue between Python 2 and Python 3. In Python 2 using shuffle after zip works, because zip returns a list. In Python 3: "TypeError: object of type 'zip' has no len()" because zip returns an iterator in Python 3.
Solution, use list() to convert to a list:
combined = list(zip(question, answer))
random.shuffle(combined)
The error appeared with shuffle() because shuffle() uses len().
References issue:
The zip() function in python 3
Stumbled upon this and was surprised to learn about the random.shuffle method. So I tried your example, and it worked for me in Python 2.7.5:
def shufflemode():
import random
combined = zip(question, answer)
random.shuffle(combined)
question[:], answer[:] = zip(*combined)
question = ["q1","q2","q3","q4","q5"]
answer = ["a1","a2","a3","a4","a5"]
if __name__ == "__main__":
shufflemode()
print question,answer
The result is two lists with the same randomized sequence of questions and answers*strong text*:
>>>
['q3', 'q2', 'q5', 'q4', 'q1'] ['a3', 'a2', 'a5', 'a4', 'a1']
>>>