Find duplicated Rows based on selected columns condition in a pandas Dataframe - python-3.x

I have an extensive base converted into a dataframe where it is difficult to manually identify the following
The dataframe has columns with the names from_bus and to_bus, which are unique identifiers regardless of the order, for example for element 0:
L_ABAN_MACA_0_1 the associated ordered pair (109,140) is the same as (140,109).
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.476683
2
L_AGOY_BAÑO_1_2
69
66
0.476683
3
L_ALAN_INGA_1_1
189
188
0.452790
4
L_ALAN_INGA_1_2
188
189
0.500450
So I want to identify the duplicate ordered pairs and replace them with a single one, whose column value x_ohn_per_km is defined as the sum of the duplicated values, as follows:
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.953366
3
L_ALAN_INGA_1_1
189
188
0.953240

Let us try groupby on from_bus and to_bus after sorting the values in these columns along axis=1 then agg to aggregate the result, optionally reindex to conform the order of columns:
c = ['from_bus', 'to_bus']
df[c] = np.sort(df[c], axis=1)
df.groupby(c, sort=False, as_index=False)\
.agg({'name': 'first', 'x_ohm_per_km': 'sum'})\
.reindex(df.columns, axis=1)
Alternative approach:
d = {**dict.fromkeys(df, 'first'), 'x_ohm_per_km': 'sum'}
df.groupby([*np.sort(df[c], axis=1).T], sort=False, as_index=False).agg(d)
name from_bus to_bus x_ohm_per_km
0 L_ABAN_MACA_0_1 109 140 0.444450
1 L_AGOY_BAÑO_1_1 66 69 0.953366
2 L_ALAN_INGA_1_1 188 189 0.953240

Related

test/train splits in pycaret using a column for grouping rows that should be in the same split

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888
You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"
One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

Pandas Aggregate columns dynamically

My goal is to aggregate data similar to SAS's "proc summary using types" My starting pandas dataframe could look like this where the database has already done the original group by all dimensions/classification variables and done some aggregate function on the measures.
So in sql this would look like
select gender, age, sum(height), sum(weight)
from db.table
group by gender, age
gender
age
height
weight
F
19
70
123
M
24
72
172
I then would like to summarize the data using pandas to calculate summary rows based on different group bys to come out with this.
gender
age
height
weight
.
.
142
295
.
19
70
123
.
24
72
172
F
.
70
123
M
.
72
172
F
19
70
123
M
24
72
172
Where the first row is agg with no group by
2 and 3 row are agg grouped by age
4 and 5 agg by just gender
and then just the normal rows
My current code looks like this
# normally dynamic just hard coded for this example
measures = {'height':{'stat':'sum'}, 'age':{'stat':'sum'}}
msr_config_dict = {}
for measure in measures:
if measure in message_measures:
stat = measures[measure]['stat']
msr_config_dict[measure] = pd.NamedAgg(measure, stat)
# compute agg with no group by as starting point
df=self.df.agg(**msr_config_dict)
dimensions = ['gender','age'] # also dimensions is dynamic in real life
dim_vars = []
for dim in dimensions:
dim_vars.append(dim)
if len(dim_vars) > 1:
# compute agg of compound dimensions
df_temp = self.df.groupby(dim_vars, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
# always compute agg of solo dimension
df_temp = self.df.groupby(dim, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
With this code I get AttributeError: 'height' is not a valid function for 'Series' object
For the input to agg function I have also tried
{'height':[('height', 'sum')], 'weight':[('weight', 'sum')]} where I am trying to compute the sum of all heights and name the output height. Which also had an attribute error.
I know I will only ever be computing one aggregate function per measure so I would like to dynamically build the input to the pandas agg functon and always rename the stat to itself so I can just append it to the dataframe that I am building with the summary rows.
I am new to pandas coming from SAS background.
Any help would be much appreciated.
IIUC:
cols = ['height', 'weight']
out = pd.concat([df[cols].sum(0).to_frame().T,
df.groupby('age')[cols].sum().reset_index(),
df.groupby('gender')[cols].sum().reset_index(),
df], ignore_index=True)[df.columns].fillna('.')
Output:
>>> out
gender age height weight
0 . . 142 295
1 . 19.0 70 123
2 . 24.0 72 172
3 F . 70 123
4 M . 72 172
5 F 19.0 70 123
6 M 24.0 72 172
Here is a more flexible solution, extending the solution of #Corralien. You can use itertools.combinations to create all the combinations of dimensions and for all length of combination possible.
from itertools import combinations
# your input
measures = {'height':{'stat':'sum'}, 'weight':{'stat':'min'}}
dimensions = ['gender','age']
# change the nested dictionary
msr_config_dict = {key:val['stat'] for key, val in measures.items()}
# concat all possible aggregation
res = pd.concat(
# case with all aggregated
[df.agg(msr_config_dict).to_frame().T]
# cases at least one column to aggregate over
+ [df.groupby(list(_dimCols)).agg(msr_config_dict).reset_index()
# for combinations of length 1, 2.. depending on the number of dimensions
for nb_cols in range(1, len(dimensions))
# all combinations of the specific lenght
for _dimCols in combinations(dimensions, nb_cols)]
# original dataframe
+ [df],
ignore_index=True)[df.columns].fillna('.')
print(res)
# gender age height weight
# 0 . . 142 123
# 1 F . 70 123
# 2 M . 72 172
# 3 . 19.0 70 123
# 4 . 24.0 72 172
# 5 F 19.0 70 123
# 6 M 24.0 72 172

How to check if a value in a column is found in a list in a column, with Spark SQL?

I have a delta table A as shown below.
point
cluster
points_in_cluster
37
1
[37,32]
45
2
[45,67,84]
67
2
[45,67,84]
84
2
[45,67,84]
32
1
[37,32]
Also I have a table B as shown below.
id
point
101
37
102
67
103
84
I want a query like the following. Here in obviously doesn't work for a list. So, what would be the right syntax?
select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster
As a result I should have a table like the following
id
point
101
37
101
32
102
45
102
67
102
84
103
45
103
67
103
84
Based on your data sample, I'd do an equi-join on point column and then an explode on points_in_cluster :
from pyspark.sql import functions as F
# assuming A is df_A and B is df_B
df_A.join(
df_B,
on="point"
).select(
"id",
F.explode("points_in_cluster").alias("point")
)
Otherwise, you use array_contains:
select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

Combining data tables

I have two data tables, similar to the ones below:
table1
index value
a 6352
a 67
a 43
b 7765
b 53
c 243
c 7
c 543
table 2
index value
a 425
a 6
b 532
b 125
b 89
b 664
c 314
I would like to combine the data in one table as in the table bellow using the index values. The order is important, so the first batch of values under one index in the common table must be from the table 1
index value
a 6352
a 67
a 43
a 425
a 6
b 7765
b 53
b 532
b 125
b 89
b 664
c 243
c 7
c 543
c 314
I tried to do it using VBA but I'm sadly a complete novice and I was wondering if someone has any pointers how to approach to write the code?
Copy the values of the second table (without the headers) under the values of the first table, select the two resultant columns and sort them by index.
Hope it works!

Resources