Python For loop taking too much time - python-3.x

I'm having two different datasets where one of them contains 650k records and another contains 20k records and I want to find matching or approximately matching data in a single column of both datasets. How to speed up the process, as Python is very slow?
Note: My data type is string in both columns of two datasets.
Here is my simple code:
from fuzzywuzzy import fuzz
df1
df2
for i in df1['string1']:
for j in df2['string2']:
Ratio = fuzz.ratio(i.lower(),j.lower())
print(Ratio)

You might want to replace your usage of FuzzyWuzzy with RapidFuzz (I am the author), which is significantly faster.
Using RapidFuzz your algorithm can be implemented in the following way:
import pandas as pd
from rapidfuzz import fuzz, process
def lower(str, **kwargs):
return str.lower()
d = {'string1': ["abcd", "abcde"]*1000, 'string2': ["abcd", "abc"]*1000}
df = pd.DataFrame(data=d)
process.cdist(df["string1"], df["string2"],
processor=lower, scorer=fuzz.ratio)
which returns
array([[100, 86, 100, ..., 86, 100, 86],
[ 89, 75, 89, ..., 75, 89, 75],
[100, 86, 100, ..., 86, 100, 86],
...,
[ 89, 75, 89, ..., 75, 89, 75],
[100, 86, 100, ..., 86, 100, 86],
[ 89, 75, 89, ..., 75, 89, 75]], dtype=uint8)
You can further improve the performance using multithreading:
process.cdist(df["string1"], df["string2"],
processor=lower, scorer=fuzz.ratio, workers=-1)
In case you do not want to use all available cores you can specifiy a different count for workers.

Related

Taking a 3*3 subset matrix from from a really large numpy ndarray in Python

I am trying to take a 3*3 subset from a really large 400 x 500 ndarray of numpy. But due to some reason, I am not getting the desired result. Rather it is taking the first three rows as a whole.
Here is the code that I wrote.
subset_matrix = mat[0:3][0:3]
But this is what I am getting in my output of my Jupyter Notebook
array([[91, 88, 87, ..., 66, 75, 82],
[91, 89, 88, ..., 68, 78, 84],
[91, 89, 89, ..., 72, 80, 87]], dtype=uint8)
mat[0:3][0:3] slice the axis 0 of the 2D array twice and is equivalent to mat[0:3]. What you need is mat[0:3,0:3].

Converting nested List to Dictionary | Python

I have made a list comprehension that generates mock fingerprint data.
import random
val = [[hand, [digit, [[random.randint(1, 250) for i in range(0, 8)] for j in range(0, 4)]]] for hand in ['Left', 'Right'] for digit in ['Thumb', 'Index', 'Middle', 'Ring', 'Little']]
val
>>> [['Left',
['Thumb',
[[247, 115, 74, 161, 47, 31, 231, 34],
[246, 52, 1, 160, 196, 65, 4, 118],
[128, 219, 128, 140, 207, 2, 156, 226],
[127, 61, 56, 151, 169, 122, 117, 105]]]]
...
['Right',
['Thumb',
[[229, 222, 138, 230, 86, 119, 201, 209],
[106, 238, 191, 15, 214, 134, 77, 145],
[186, 174, 81, 143, 138, 5, 54, 148],
[176, 85, 205, 235, 228, 204, 91, 17]]]]
Note: "digit" means finger name.
Based on this list output, I would like to convert it to a dictionary of the below structure.
I've been unable to make a dictionary comprehension that would yield the desired output:
{
"Left":
{
"Thumb": [...],
"Index": [...],
"Middle": [...],
"Ring": [...],
"Little": [...]
},
"Right":
{
"Thumb": [...],
"Index": [...],
"Middle": [...],
"Ring": [...],
"Little": [...]
}
}
Note: I understand my above dictionary "output" is somewhat incorrect.
I'm using Jupyter Notebooks. Best attempt:
output = {hand: {digit: matrix for i in val for hand in val[i] for digit in hand}}
output
>>> ---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-46-500095a8cf85> in <module>
----> 1 output = {hand: {digit: matrix for i in val for hand in val[i] for digit in hand}}
2 output
NameError: name 'hand' is not defined
You can just create the structure directly with a dict comprehension.
import random
output = {
hand: {
digit: [
random.randint(1, 250) for _ in range(0, 8)
] for digit in ['Thumb', 'Index', 'Middle', 'Ring', 'Little']
} for hand in ['Left', 'Right']
}
print(output)
The trick to nested comprehensions is to create your inner-most structure first and work outwards from there.
So create your inner list,
inner = [random.randint(1, 250) for _ in range(0, 8)]
Then your dict of digits.
digits = {digit: inner for digit in ['Thumb', 'Index', 'Middle', 'Ring', 'Little']}
And lastly your dict of hands.
hands = {hand: digits for hand in ['Left', 'Right']}
Combine those all together and you get.
{
hand: {
digit: [
random.randint(1, 250) for _ in range(0, 8)
] for digit in ['Thumb', 'Index', 'Middle', 'Ring', 'Little']
} for hand in ['Left', 'Right']
}

Merge two keys of a single dictionary in python

For a dictionary "a", with the keys "x, y and z" containing integer values.
What is the most efficient way to produce a joint list if I want to merge two keys in the dictionary (considering the size of the keys are identical and the values are of interger type)?
x+y and y+z ? .
Explanation:
Suppose you have to merge two keys and merge them into a new list or new dict without altering original dictionaries.
Example:
a = {"x" : {1,2,3,....,50}
"y" : {1,2,3,.....50}
"z" : {1,2,3,.....50}
}
Desired list:
x+y = [2,4,6,8.....,100]
y+z = [2,4,6,......,100]
A very efficient way is to do convert the dictionary to a pandas dataframe and allow it to do the job for you with its vectorized methods:
import pandas as pd
a = {"x" : range(1,51), "y" : range(1,51), "z" : range(1,51)}
df = pd.DataFrame(a)
x_plus_y = (df['x'] + df['y']).to_list()
y_plus_z = (df['y'] + df['z']).to_list()
print(x_plus_y)
#[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100]
It seems like you're trying to mimic a join-type operation. That is not native to python dicts, so if you really want that type of functionality, I'd recommend looking at the pandas library.
If you just want to merge dict keys without more advanced features, this function should help:
from itertools import chain
from collections import Counter
from typing import Dict, List, Set, Tuple
def merge_keys(data: Dict[str, Set[int]], *merge_list: List[Tuple[str, str]]):
merged_data = dict()
merged_counts = Counter(list(chain(*map(lambda k: list(data.get(k, {})) if k in merge_list else [], data))))
merged_data['+'.join(merge_list)] = [k*v for k,v in merged_counts.items()]
return merged_data
You can run this with merge_keys(a, "x", "y", "z", ...), where a is the name of your dict- you can put as many keys as you want ("x", "y", "z", ...), since this function takes a variable number of arguments.
If you want two separate merges in the same dict, all you need to do is:
b = merge_keys(a, "x", "y") | merge_keys(a, "y", "z")
Note that the order of the keys changes the final merged key ("y+z" vs "z+y") but not the value of their merged sets.
P.S: This was actually a little tricky since the original dict had set values, not lists, which aren't ordered, so you can't just add them elementwise. That's why I used Counter here, in case you were wondering.

How to display name and grade for a student in a dictiionary who has the highest grade

I'm a brand new to programming and I'm stuck on a practice exercise.
EDIT: exact error code "avgDict[k] =max(sum(v)/ float(len(v)))
TypeError: 'float' object is not iterable"
If I remove max it's printing every student's avg.
# student_grades contains scores (out of 100) for 5 assignments
diction = {
'Andrew': [56, 79, 90, 22, 50],
'Colin': [88, 62, 68, 75, 78],
'Alan': [95, 88, 92, 85, 85],
'Mary': [76, 88, 85, 82, 90],
'Tricia': [99, 92, 95, 89, 99]
}
def averageGrades(diction):
avgDict = {}
for k, v in diction.items():
avgDict[k] =max(sum(v)/ float(len(v)))
return avgDict
What your code is doing right now is iterating through each key-value pair of the dictionary (where the key is the student's name and the value is the list of the student's grades), and then calculating the average for the student. Then, the way you are using max right now, it is trying to find the max of a single student's average. This is why you are receiving the error, because max expects either an iterable or multiple parameters, and a float (which is the value produced by sum(v) / float(len(v))) is not an iterable. You should instead compute all of the averages first, and then find the max value in the dictionary of averages:
diction = {
'Andrew': [56, 79, 90, 22, 50],
'Colin': [88, 62, 68, 75, 78],
'Alan': [95, 88, 92, 85, 85],
'Mary': [76, 88, 85, 82, 90],
'Tricia': [99, 92, 95, 89, 99]
}
def averageGrades(diction):
avgDict = {}
for k, v in diction.items():
avgDict[k] = sum(v) / len(v)
return max(avgDict.items(), key=lambda i: i[1]) # find the pair which has the highest value
print(averageGrades(diction)) # ('Tricia', 94.8)
Sidenote, in Python 3, using / does normal division (as opposed to integer division) by default, so casting len(v) to a float is unnecessary.
Alternatively, if you don't need to create the avgDict variable, you can just determine the max directly without the intermediate variable:
def averageGrades(diction):
return max([(k, sum(v) / len(v)) for k, v in diction.items()], key=lambda i: i[1])
print(averageGrades(diction)) # ('Tricia', 94.8)

ValueError for a matplotlib contour plot in Python

I receive "ValueError: setting an array element with a sequence" when running. I have tried to turn everything into a numpy array to no avail.
import matplotlib
import numpy as np
from matplotlib import pyplot
X=np.array([
np.array([1,2,3,4,5,6,7]),
np.array([1,2,3,4,5,6,7]),
np.array([1,2,3,4,5,6,6.5,7.5]),
np.array([1,2,3,4,5,6,7,8]),
np.array([1,2,3,4,5,6,7,8,8.5]),
np.array([1,2,3,4,5,6,7,8]),
np.array([1,2,3,4,5,6,7])])
Y=np.array([
np.array([1,1,1,1,1,1,1]),
np.array([2,2,2,2,2,2,2]),
np.array([3,3,3,3,3,3,2.5,3]),
np.array([4,4,4,4,4,4,4,4]),
np.array([5,5,5,5,5,5,5,5,5]),
np.array([6,6,6,6,6,6,6,6]),
np.array([7,7,7,7,7,7,7])])
Z= np.array([
np.array([4190, 4290, 4200, 4095, 4181, 4965, 4995]),
np.array([4321, 4389, 4311, 4212, 4894, 4999, 5001]),
np.array([4412, 4442, 4389, 4693, 4899, 5010, 5008, 4921]),
np.array([4552, 4651, 4900, 4921, 4932, 5020, 4935, 4735]),
np.array([4791, 4941, 4925, 5000, 4890, 4925, 4882, 4764, 4850]),
np.array([4732, 4795, 4791, 4852, 4911, 4865, 4919, 4862]),
np.array([4520, 4662, 4735,4794,4836,4852,4790])])
matplotlib.pyplot.contour(X, Y, Z)
EDIT
I sort of solved this problem by removing values from my sub-arrays in order to make the lengths equal, however I would still like to know how it is possible to feed an array containing sub-arrays of different lengths into contour plot.
The answer is to make X, Y and Z inputs all 1D arrays and to use tricontour instead of contour.
X=np.array([1,2,3,4,5,6,7,
1,2,3,4,5,6,7,
1,2,3,4,5,6,6.5,7.5,
1,2,3,4,5,6,7,8,
1,2,3,4,5,6,7,8,9,
1,2,3,4,5,6,7,8,
1,2,3,4,5,6,7])
Y=np.array([1,1,1,1,1,1,1,
2,2,2,2,2,2,2,
3,3,3,3,3,3,2.5,3,
4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7])
Z= np.array([80, 73, 65, 57, 61, 55, 60,
78, 73, 71, 55, 55, 60, 90,
65, 62, 61, 61, 51, 60, 71, 78,
70, 58, 58, 65, 80, 81, 90, 81,
80, 59, 51, 58, 70, 70, 90, 89, 78,
90, 63, 55, 58, 65, 78, 79, 70,
100, 68, 54,52,60,72,71])
Y=np.flip(Y,0)
asdf=matplotlib.pyplot.tricontour(X, Y, Z,11)
matplotlib.pyplot.xlim([1,8])
matplotlib.pyplot.ylim([1,7])
matplotlib.pyplot.clabel(asdf, fontsize=6, inline=0)
matplotlib.pyplot.show()

Resources