Calculate the amount of km traveled by each person in each vehicle - python-3.x

I have a rather large amount of odometer values from a vehicle fleet of about 40 vehicles,
driven by different persons, but in my example below, I am keeping it simple.
I have everything imported into pandas, and next to my odometer values, I have who was driving the vehicle when the odometer log event was triggered (usually they would trigger every km, but sometimes it triggers more, sometimes it triggers less)
now I need to figure out the amount of km vehicle X have traveled, while person Y was behind the wheel, but I am not sure how.
matrix = [(1, '501', "Me"),
(1, '502', "Me"),
(1, '502', "Wife"),
(1, '503', "Wife"),
(1, '504', "Wife"),
(1, '505', "Wife"),
(1, '506', "Wife"),
(1, '507', "Wife"),
(1, '508', "Wife"),
(1, '509', "Me"),
(1, '510', "Me"),
(1, '511', "Me"),
(1, '512', "Me"),
(1, '520', "Wife"),
(1, '522', "Me"),
(1, '523', "Me"),
(1, '524', "Me"),
(1, '524', "Me"),
(1, '524', "Me"),
(1, '524', "Me"),
(1, '525', "Me"),
(2, '126', "Me"),
(2, '127', "Me"),
(2, '128', "Me"),
(2, '129', "Me"),
]
# Create a DataFrame object
dfObj = pd.DataFrame(matrix, columns=['Vehicle', 'ODOmeter', 'Who'])
print (dfObj)
print ("\nVehicle 1 have driven 10 km with Me behind the wheel\nand 14 km with Wife behind the wheel\nVehicle 2 have driven 3 km with me behind the wheel")

Copy and Paste answer
Here a function to calculate and print the desired output:
def print_kilometers(dfObj):
dfObj['ODOmeter'] = dfObj['ODOmeter'].astype(float)
dfObj["diff"]= dfObj.groupby("Vehicle")["ODOmeter"].diff()
sum_km = dfObj.groupby(["Who", "Vehicle"])["diff"].sum()
for i, v in sum_km.items():
print("Vehicle {} have driven {} km with {} behind the wheel".format(i[1], v, i[0]))
Explanation
If I understand well you problem you can simply calculate the diffirence between the km of a vehicle and use a the groupby function provided by pandas. Using your dataframe as example you can do something like:
dfObj['ODOmeter'] = dfObj['ODOmeter'].astype(float)
dfObj["diff"]= dfObj.groupby("Vehicle")["ODOmeter"].diff()
sum_km = dfObj.groupby(["Who", "Vehicle"])["diff"].sum()
In the first line I convert the ODOmeter column to float, (if kilometers can only be integer values you can switch to int), in the second I added a column to dataframe with the diff and in the last line I group by Who and Vehicle and summed over diff columns.
You can loop and print the results:
for i, v in sum_km.items():
print("Vehicle {} have driven {} km with {} behind the wheel".format(i[1], v, i[0]))

First you need to cast the Odometer to the proper type:
dfObj['ODOmeter'] = dfObj['ODOmeter'].astype(int)
Then you calculate the difference between two Odometer datapoints for each vehicle, giving you the tripmeter
dfObj['Tripmeter'] = dfObj.groupby('Vehicle')['ODOmeter'].diff()
And then simply add up the Tripmeters for each Vehicle and Person
dfObj.groupby(['Vehicle', 'Who'])["Tripmeter"].sum()
# Vehicle Who
# 1 Me 10.0
# Wife 14.0
# 2 Me 3.0
# Name: ODOmeter, dtype: float64

Related

pandas fuzzy match on the same column but prevent matching against itself

This is a common question but I have an extra condition: how do I remove matches based on a unique ID? Or, how to prevent matching against itself?
Given a dataframe:
df = pd.DataFrame({'id':[1, 2, 3],
'name':['pizza','pizza toast', 'ramen']})
I used solutions like this one to create a multi-index dataframe:
Fuzzy match strings in one column and create new dataframe using fuzzywuzzy
df_copy = df.copy()
compare = pd.MultiIndex.from_product([df['name'], df_copy['name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
So that's great but how can I use the unique ID to prevent matching against itself?
If there's a case of ID/name = 1/pizza and 10/pizza, obviously I want to keep those. But I need to remove the same ID in both indexes.
I suggest a slightly different approach for the same result using Python standard library difflib module, which provides helpers for computing deltas.
So, with the following dataframe in which pizza has two different ids (and thus should be checked against one another later on):
import pandas as pd
df = pd.DataFrame(
{"id": [1, 2, 3, 4], "name": ["pizza", "pizza toast", "ramen", "pizza"]}
)
Here is how you can find similarities between different id/name combinations, but avoid checking an id/name combination against itself:
from difflib import SequenceMatcher
# Define a simple helper function
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
And then, with the following steps:
# Create a column of unique identifiers: (id, name)
df["id_and_name"] = list(zip(df["id"], df["name"]))
# Calculate ratio only for different id_and_names
df = df.assign(
match=df["id_and_name"].map(
lambda x: {
value: ratio(x[1], value[1])
for value in df["id_and_name"]
if x[0] != value[0] or ratio(x[1], value[1]) != 1
}
)
)
# Format results in a readable fashion
df = (
pd.DataFrame(df["match"].to_list(), index=df["id_and_name"])
.reset_index(drop=False)
.melt("id_and_name", var_name="other_id_and_name", value_name="ratio")
.dropna()
.sort_values(by=["id_and_name", "ratio"], ascending=[True, False])
.reset_index(drop=True)
.pipe(lambda df_: df_.assign(ratio=df_["ratio"] * 100))
.pipe(lambda df_: df_.assign(ratio=df_["ratio"].astype(int)))
)
You get the expected result:
print(df)
# Output
id_and_name other_id_and_name ratio
0 (1, pizza) (4, pizza) 100
1 (1, pizza) (2, pizza toast) 62
2 (1, pizza) (3, ramen) 20
3 (2, pizza toast) (4, pizza) 62
4 (2, pizza toast) (1, pizza) 62
5 (2, pizza toast) (3, ramen) 12
6 (3, ramen) (4, pizza) 20
7 (3, ramen) (1, pizza) 20
8 (3, ramen) (2, pizza toast) 12
9 (4, pizza) (1, pizza) 100
10 (4, pizza) (2, pizza toast) 62
11 (4, pizza) (3, ramen) 20

Interpolate seconds to milliseconds in dataset?

I have a sorted dataset by timestamps in seconds. However I need to somehow convert it to millisecond accuracy.
Example
dataset = [
# UNIX timestamps with reading data
(0, 0.48499),
(2, 0.48475),
(3, 0.48475),
(3, 0.48473),
(3, 0.48433),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(5, 0.48396),
(12, 0.48353),
]
Expected output (roughly)
interpolated = [
# Timestamps with millisecond accuracy
(0.0, 0.48499),
(2.0, 0.48475),
(3.0, 0.48475),
(3.14, 0.48473),
(3.28, 0.48433),
(3.42, 0.48403),
(3.57, 0.48403),
(3.71, 0.48403),
(3.85, 0.48403),
(3.99, 0.48403),
(5.0, 0.48396),
(12.0, 0.48353),
]
I don't have much experience with Pandas and I've gone through interpolate and drop_duplicates but couldn't figure out how to go about this.
I would think this is a common problem so any help appreciated. Ideally I want to spread evenly the numbers.
You can use groupby and apply methods. I didn't come up with a specific method like interpolate in this case, but there might be a more pythonic way.
Code:
import numpy as np
import pandas as pd
# Create a sample dataframe
dataset = [(0, 0.48499), (2, 0.48475), (3, 0.48475), (3, 0.48473), (3, 0.48433), (3, 0.48403), (3, 0.48403), (3, 0.48403), (3, 0.48403), (3, 0.48403), (5, 0.48396), (12, 0.48353)]
df = pd.DataFrame(dataset, columns=['t', 'value'])
# Convert UNIX timestamps into the desired format
df.t = df.groupby('t', group_keys=False).apply(lambda df: df.t + np.linspace(0, 1, len(df)))
Output:
t
value
0
0.48499
2
0.48475
3
0.48475
3.14286
0.48473
3.28571
0.48433
3.42857
0.48403
3.57143
0.48403
3.71429
0.48403
3.85714
0.48403
4
0.48403
5
0.48396
12
0.48353
(Input:)
t
value
0
0.48499
2
0.48475
3
0.48475
3
0.48473
3
0.48433
3
0.48403
3
0.48403
3
0.48403
3
0.48403
3
0.48403
5
0.48396
12
0.48353

python find the top N weighted edges regardless of weight

I am looking for a way to find the biggest 5 weighted edges in a node. Is there a way to specify that I want exactly the biggest 5 edges without a specific threshold value(a.k.a universal for any weighted graph)?
You could consider the edges sorted by weight and build a dictionary that maps a node with its edges, sorted by weight in a non-increasing way.
>>> from collections import defaultdict
>>> res = defaultdict(list)
>>> for u,v in sorted(G.edges(), key=lambda x: G.get_edge_data(x[0], x[1])["weight"], reverse=True):
... res[u].append((u,v))
... res[v].append((u,v))
...
Then, given a node (e.g., 0), you could get the top N (e.g., 5) weighted edges as
>>> res[0][:5]
[(0, 7), (0, 2), (0, 6), (0, 1), (0, 3)]
If you only need to do it for a node (e.g., 0), you can directly do:
>>> sorted_edges_u = sorted(G.edges(0), key=lambda x: G.get_edge_data(x[0], x[1])["weight"], reverse=True)
>>> sorted_edges_u[:5]
[(0, 7), (0, 2), (0, 6), (0, 1), (0, 3)]

pandas get max threshold values from tuples in list

I am working with pandas dataframe. One of the columns has list of tuples in each row with some score. I am trying to get scores higher than 0.20. How do I put a threshold instead of max? I tried itemgetter and lambda if else. It didn't worked as I thought. What am I doing wrong?
from operator import itemgetter
import pandas as pd
# sample data
l1 = ['1','2','3']
l2 = ['test1','test2','test3']
l3 = [[(1,0.95),(5,0.05)],[(7,0.10),(1,0.20),(6,0.70)],[(7,0.30),(1,0.70)]]
df = pd.DataFrame({'id':l1,'text':l2,'score':l3})
print(df)
# # Preview from print statement above
id text score
1 test1 [(1, 0.95), (5, 0.05)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)]
# Try #1:
print(df['score'].apply(lambda x: max(x,key=itemgetter(0))))
# Preview from print statement above
(5, 0.05)
(7, 0.1)
(7, 0.3)
# Try #2: Gives `TypeError`
df['score'].apply(lambda x: ((x,itemgetter(0)) if x >= 0.20 else ''))
What I am trying to get for output:
id text probability output needed
1 test1 [(1, 0.95), (5, 0.05)] [(1, 0.95)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)] [(1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)] [(7, 0.3), (1, 0.7)]
You can use a pretty straightforward list comprehension to get the desired output. I'm not sure how you would use itemgetter for this:
df['score'] = df['score'].apply(lambda x: ([y for y in x if min(y) >= .2]))
df
id text score
0 1 test1 [(1, 0.95)]
1 2 test2 [(1, 0.2), (6, 0.7)]
2 3 test3 [(7, 0.3), (1, 0.7)]
If you wanted an alternative result (like an empty tuple, you can use:
df['score'] = df['score'].apply(lambda x: ([y if min(y) >= .2 else () for y in x ]))

Selecting sublists of a list of lists to define a relation

If I happen to have the following list of lists:
L=[[(1,3)],[(1,3),(2,4)],[(1,3),(1,4)],[(1,2)],[(1,2),(1,3)],[(1,3),(2,4),(1,2)]]
and what I wish to do, is to create a relation between lists in the following way:
I wish to say that
[(1,3)] and [(1,3),(1,4)]
are related, because the first is a sublist of the second, but then I would like to add this relation into a list as:
Relations=[([(1,3)],[(1,3),(1,4)])]
but, we can also see that:
[(1,3)] and [(1,3),(2,4)]
are related, because the first is a sublist of the second, so I would want this to also be a relation added into my Relations list:
Relations=[([(1,3)],[(1,3),(1,4)]),([(1,3)],[(1,3),(2,4)])]
The only thing I wish to be careful with, is that I am considering for a list to be a sublist of another if they only differ by ONE element. So in other words, we cannot have:
([(1,3)],[(1,3),(2,4),(1,2)])
as an element of my Relations list, but we SHOULD have:
([(1,3),(2,4)],[(1,3),(2,4),(1,2)])
as an element in my Relations list.
I hope there is an optimal way to do this, since in the original context I have to deal with a much bigger list of lists.
Any help given is much appreciated.
You really haven't provided enough information, so can't tell if you need itertools.combinations() or itertools.permutations(). Your examples work with itertools.combinations so will use that.
If x and y are two elements of the list then you just want all occurrences where the set(x).issubset(y) and the size of the set difference is <= 1 - len(set(y) - set(x)) <= 1, e.g.:
In []:
[[x, y] for x, y in it.combinations(L, r=2) if set(x).issubset(y) and len(set(y)-set(x)) <= 1]
Out[]:
[[[(1, 3)], [(1, 3), (2, 4)]],
[[(1, 3)], [(1, 3), (1, 4)]],
[[(1, 3)], [(1, 2), (1, 3)]],
[[(1, 3), (2, 4)], [(1, 3), (2, 4), (1, 2)]],
[[(1, 2)], [(1, 2), (1, 3)]],
[[(1, 2), (1, 3)], [(1, 3), (2, 4), (1, 2)]]]

Resources