pandas fuzzy match on the same column but prevent matching against itself - python-3.x

This is a common question but I have an extra condition: how do I remove matches based on a unique ID? Or, how to prevent matching against itself?
Given a dataframe:
df = pd.DataFrame({'id':[1, 2, 3],
'name':['pizza','pizza toast', 'ramen']})
I used solutions like this one to create a multi-index dataframe:
Fuzzy match strings in one column and create new dataframe using fuzzywuzzy
df_copy = df.copy()
compare = pd.MultiIndex.from_product([df['name'], df_copy['name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
So that's great but how can I use the unique ID to prevent matching against itself?
If there's a case of ID/name = 1/pizza and 10/pizza, obviously I want to keep those. But I need to remove the same ID in both indexes.

I suggest a slightly different approach for the same result using Python standard library difflib module, which provides helpers for computing deltas.
So, with the following dataframe in which pizza has two different ids (and thus should be checked against one another later on):
import pandas as pd
df = pd.DataFrame(
{"id": [1, 2, 3, 4], "name": ["pizza", "pizza toast", "ramen", "pizza"]}
)
Here is how you can find similarities between different id/name combinations, but avoid checking an id/name combination against itself:
from difflib import SequenceMatcher
# Define a simple helper function
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
And then, with the following steps:
# Create a column of unique identifiers: (id, name)
df["id_and_name"] = list(zip(df["id"], df["name"]))
# Calculate ratio only for different id_and_names
df = df.assign(
match=df["id_and_name"].map(
lambda x: {
value: ratio(x[1], value[1])
for value in df["id_and_name"]
if x[0] != value[0] or ratio(x[1], value[1]) != 1
}
)
)
# Format results in a readable fashion
df = (
pd.DataFrame(df["match"].to_list(), index=df["id_and_name"])
.reset_index(drop=False)
.melt("id_and_name", var_name="other_id_and_name", value_name="ratio")
.dropna()
.sort_values(by=["id_and_name", "ratio"], ascending=[True, False])
.reset_index(drop=True)
.pipe(lambda df_: df_.assign(ratio=df_["ratio"] * 100))
.pipe(lambda df_: df_.assign(ratio=df_["ratio"].astype(int)))
)
You get the expected result:
print(df)
# Output
id_and_name other_id_and_name ratio
0 (1, pizza) (4, pizza) 100
1 (1, pizza) (2, pizza toast) 62
2 (1, pizza) (3, ramen) 20
3 (2, pizza toast) (4, pizza) 62
4 (2, pizza toast) (1, pizza) 62
5 (2, pizza toast) (3, ramen) 12
6 (3, ramen) (4, pizza) 20
7 (3, ramen) (1, pizza) 20
8 (3, ramen) (2, pizza toast) 12
9 (4, pizza) (1, pizza) 100
10 (4, pizza) (2, pizza toast) 62
11 (4, pizza) (3, ramen) 20

Related

pandas get max threshold values from tuples in list

I am working with pandas dataframe. One of the columns has list of tuples in each row with some score. I am trying to get scores higher than 0.20. How do I put a threshold instead of max? I tried itemgetter and lambda if else. It didn't worked as I thought. What am I doing wrong?
from operator import itemgetter
import pandas as pd
# sample data
l1 = ['1','2','3']
l2 = ['test1','test2','test3']
l3 = [[(1,0.95),(5,0.05)],[(7,0.10),(1,0.20),(6,0.70)],[(7,0.30),(1,0.70)]]
df = pd.DataFrame({'id':l1,'text':l2,'score':l3})
print(df)
# # Preview from print statement above
id text score
1 test1 [(1, 0.95), (5, 0.05)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)]
# Try #1:
print(df['score'].apply(lambda x: max(x,key=itemgetter(0))))
# Preview from print statement above
(5, 0.05)
(7, 0.1)
(7, 0.3)
# Try #2: Gives `TypeError`
df['score'].apply(lambda x: ((x,itemgetter(0)) if x >= 0.20 else ''))
What I am trying to get for output:
id text probability output needed
1 test1 [(1, 0.95), (5, 0.05)] [(1, 0.95)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)] [(1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)] [(7, 0.3), (1, 0.7)]
You can use a pretty straightforward list comprehension to get the desired output. I'm not sure how you would use itemgetter for this:
df['score'] = df['score'].apply(lambda x: ([y for y in x if min(y) >= .2]))
df
id text score
0 1 test1 [(1, 0.95)]
1 2 test2 [(1, 0.2), (6, 0.7)]
2 3 test3 [(7, 0.3), (1, 0.7)]
If you wanted an alternative result (like an empty tuple, you can use:
df['score'] = df['score'].apply(lambda x: ([y if min(y) >= .2 else () for y in x ]))

List of list to get element whose values greater than 3

I have 2 list where each list is of size 250000. I wanted to iterate thru the lists and return the values that are greater than 3.
For example:
import itertools
from array import array
import numpy as np
input = (np.array([list([8,1]), list([2,3,4]), list([5,3])],dtype=object), np.array([1,0,0,0,1,1,1]))
X = input[0]
y = input[1]
res = [ u for s in X for u in zip(y,s) ]
res
I don't get the expected output.
Actual res : [(1, 8), (0, 1), (1, 2), (0, 3), (0, 4), (1, 5), (0, 3)]
Expected output 1 : [(8,1), (1,0), (2, 0), (3, 0), (4, 1), (5, 1), (3, 1)]
Expected output 2 : [(8,1), (4, 1), (5, 1))] ---> for greater than 3
I took references from stackoverflow. Tried itertools as well.
Using NumPy to store lists of non-uniform lengths creates a whole lot of issues, like the ones you are seeing. If it were an array integers, you could simply do
X[X > 3]
but since it is an array of lists, you have to jump through all sorts of hoops to get what you want, and basically lose all the advantages of using NumPy in the first place. You could just as well use lists of lists and skip NumPy altogether.
As an alternative I would recommend using Pandas or something else more suitable than NumPy:
import pandas as pd
df = pd.DataFrame({
'group': [0, 0, 1, 1, 1, 2, 2],
'data': [8, 1, 2, 3, 4, 5, 4],
'flag': [1, 0, 0, 0, 1, 1, 1],
})
df[df['data'] > 3]
# group data flag
# 0 0 8 1
# 4 1 4 1
# 5 2 5 1
# 6 2 4 1
Use filter
For example:
input = [1, 3, 2, 5, 6, 7, 8, 22]
# result contains even numbers of the list
result = filter(lambda x: x % 2 == 0, input)
This should give you result = [2, 6, 8, 22]
Not sureI quite understand exactly what you're trying to do... but filter is probably a good way.

Calculate the amount of km traveled by each person in each vehicle

I have a rather large amount of odometer values from a vehicle fleet of about 40 vehicles,
driven by different persons, but in my example below, I am keeping it simple.
I have everything imported into pandas, and next to my odometer values, I have who was driving the vehicle when the odometer log event was triggered (usually they would trigger every km, but sometimes it triggers more, sometimes it triggers less)
now I need to figure out the amount of km vehicle X have traveled, while person Y was behind the wheel, but I am not sure how.
matrix = [(1, '501', "Me"),
(1, '502', "Me"),
(1, '502', "Wife"),
(1, '503', "Wife"),
(1, '504', "Wife"),
(1, '505', "Wife"),
(1, '506', "Wife"),
(1, '507', "Wife"),
(1, '508', "Wife"),
(1, '509', "Me"),
(1, '510', "Me"),
(1, '511', "Me"),
(1, '512', "Me"),
(1, '520', "Wife"),
(1, '522', "Me"),
(1, '523', "Me"),
(1, '524', "Me"),
(1, '524', "Me"),
(1, '524', "Me"),
(1, '524', "Me"),
(1, '525', "Me"),
(2, '126', "Me"),
(2, '127', "Me"),
(2, '128', "Me"),
(2, '129', "Me"),
]
# Create a DataFrame object
dfObj = pd.DataFrame(matrix, columns=['Vehicle', 'ODOmeter', 'Who'])
print (dfObj)
print ("\nVehicle 1 have driven 10 km with Me behind the wheel\nand 14 km with Wife behind the wheel\nVehicle 2 have driven 3 km with me behind the wheel")
Copy and Paste answer
Here a function to calculate and print the desired output:
def print_kilometers(dfObj):
dfObj['ODOmeter'] = dfObj['ODOmeter'].astype(float)
dfObj["diff"]= dfObj.groupby("Vehicle")["ODOmeter"].diff()
sum_km = dfObj.groupby(["Who", "Vehicle"])["diff"].sum()
for i, v in sum_km.items():
print("Vehicle {} have driven {} km with {} behind the wheel".format(i[1], v, i[0]))
Explanation
If I understand well you problem you can simply calculate the diffirence between the km of a vehicle and use a the groupby function provided by pandas. Using your dataframe as example you can do something like:
dfObj['ODOmeter'] = dfObj['ODOmeter'].astype(float)
dfObj["diff"]= dfObj.groupby("Vehicle")["ODOmeter"].diff()
sum_km = dfObj.groupby(["Who", "Vehicle"])["diff"].sum()
In the first line I convert the ODOmeter column to float, (if kilometers can only be integer values you can switch to int), in the second I added a column to dataframe with the diff and in the last line I group by Who and Vehicle and summed over diff columns.
You can loop and print the results:
for i, v in sum_km.items():
print("Vehicle {} have driven {} km with {} behind the wheel".format(i[1], v, i[0]))
First you need to cast the Odometer to the proper type:
dfObj['ODOmeter'] = dfObj['ODOmeter'].astype(int)
Then you calculate the difference between two Odometer datapoints for each vehicle, giving you the tripmeter
dfObj['Tripmeter'] = dfObj.groupby('Vehicle')['ODOmeter'].diff()
And then simply add up the Tripmeters for each Vehicle and Person
dfObj.groupby(['Vehicle', 'Who'])["Tripmeter"].sum()
# Vehicle Who
# 1 Me 10.0
# Wife 14.0
# 2 Me 3.0
# Name: ODOmeter, dtype: float64

how to line plot values column vs groups

I have dataframe x2 with two columns. i am trying to plot but didnt get xticks.
data:
bins pp
0 (0, 1] 0.155463
1 (1, 2] 1.528947
2 (2, 3] 2.436064
3 (3, 4] 3.507811
4 (4, 5] 4.377849
5 (5, 6] 5.538044
6 (6, 7] 6.577340
7 (7, 8] 7.510983
8 (8, 9] 8.520378
9 (9, 10] 9.721899
i tried this code result is fine just cant find x-axis ticks just blank. i want bins column value should be on x-axis
x2.plot(x='bins',y=['pp'])
x2.dtypes
Out[141]:
bins category
pp float64
The following is to show that this problem should not occur with pandas 0.24.1 or higher.
import numpy as np
import pandas as pd
print(pd.__version__) # prints 0.24.2
import matplotlib.pyplot as plt
df = pd.DataFrame({"Age" : np.random.rayleigh(30, size=300)})
s = pd.cut(df["Age"], bins=np.arange(0,91,10)).value_counts().to_frame().sort_index().reset_index()
s.plot(x='index',y="Age")
plt.show()
results in

Get degree of each nodes in a graph by Networkx in python

Suppose I have a data set like below that shows an undirected graph:
1 2
1 3
1 4
3 5
3 6
7 8
8 9
10 11
I have a python script like it:
for s in ActorGraph.degree():
print(s)
that is a dictionary consist of key and value that keys are node names and values are degree of nodes:
('9', 1)
('5', 1)
('11', 1)
('8', 2)
('6', 1)
('4', 1)
('10', 1)
('7', 1)
('2', 1)
('3', 3)
('1', 3)
In networkx documentation suggest to use values() for having nodes degree.
now I like to have just keys that are degree of nodes and I use this part of script but it does't work and say object has no attribute 'values':
for s in ActorGraph.degree():
print(s.values())
how can I do it?
You are using version 2.0 of networkx. Which changed from using a dict for G.degree() to using a dict-like (but not dict) DegreeView. See this guide.
To have the degrees in a list you can use a list-comprehension:
degrees = [val for (node, val) in G.degree()]
I'd like to add the following: if you're initializing the undirected graph with nx.Graph() and adding the edges afterwards, just beware that networkx doesn't guarrantee the order of nodes will be preserved -- this also applies to degree(). This means that if you use the list comprehension approach then try to access the degree by list index the indexes may not correspond to the right nodes. If you'd like them to correspond, you can instead do:
degrees = [val for (node, val) in sorted(G.degree(), key=lambda pair: pair[0])]
Here's a simple example to illustrate this:
>>> edges = [(0, 1), (0, 3), (0, 5), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (2, 5)]
>>> g = nx.Graph()
>>> g.add_edges_from(edges)
>>> print(g.degree())
[(0, 3), (1, 4), (3, 3), (5, 2), (2, 4), (4, 2)]
>>> print([val for (node, val) in g.degree()])
[3, 4, 3, 2, 4, 2]
>>> print([val for (node, val) in sorted(g.degree(), key=lambda pair: pair[0])])
[3, 4, 4, 3, 2, 2]
You can also use a dict comprehension to get an actual dictionary:
degrees = {node:val for (node, val) in G.degree()}

Resources