Sum of rows that a match a condition in a dataframe - python-3.x

I have this large dataframe in which I recurrently have duplicated events, one of which has a timestamp of zero:
a b c
139 4E+08 0.234
163 6E+08 0.964
163 0 0.034
172 6E+08 1.173
183 6E+08 0.734`
183 0 0.296
and so on.
What I would like to do is to sum every rows that satisfy the condition timestamp=0 to the previous one, to have something like this:
a b c
139 4E+08 0.234
163 6E+08 0.998
172 6E+08 1.173
183 6E+08 1.030
I looked at various solutions but can find the proper one... how could I solve this? thanks

The answer assumes that your rows which have 0 values for b are going to have same a value as its previous row. Since you have not explicitly stated this, so here is my answer that generalize without any assumptions.
Using pandas.apply to add the 0 values to each of the previous row in main dataframe.
data_dict = {'a' : [139,163,163,172,183,183], 'b' : [float(4e08), float(6e08), 0, float(6e08), float(6e08), 0], 'c' : [0.234,0.964,0.034,1.173,0.734,0.296]}
df = pd.DataFrame(data_dict)
mask = df['b'].eq(0)
def adder(row):
if row.name:
df.loc[row.name-1, 'c'] += row['c']
_ = df[df['b'].eq(0)].apply(adder, axis=1)
df = df[~mask]
which gives us the expected output
df
a b c
0 139 400000000.0 0.234
1 163 600000000.0 0.998
3 172 600000000.0 1.173
4 183 600000000.0 1.030
Alternatively
Here is another solution using pandas.dataframe.assign and pandas.dataframe.loc.
df = df.assign(b_t = df['b'].shift(-1), c_t = df['c'].shift(-1) )
mask = df['b_t'].eq(0)
df[mask].loc[:,'c'] += df[mask].loc[:,'c_t']
df = df[~df['b'].eq(0)]
which gives us the same output :
df
a b c
0 139 400000000.0 0.234
1 163 600000000.0 0.964
3 172 600000000.0 1.173
4 183 600000000.0 0.734
Don't want another mask
If you don't want to create another mask, you can use the same mask as well.
df = df.assign(b_t = df['b'].shift(-1), c_t = df['c'].shift(-1) )
mask = df['b_t'].eq(0)
df[mask].loc[:,'c'] += df[mask].loc[:,'c_t']
mask = mask.shift(1).fillna(False)
df.mask(mask, inplace=True)
df = df.dropna().drop(['b_t', 'c_t'], axis = 1)
which gives us
a b c
0 139.0 400000000.0 0.234
1 163.0 600000000.0 0.964
3 172.0 600000000.0 1.173
4 183.0 600000000.0 0.734

If I understood correctly what you want to do, you could group by the values in column a and sum the values on each of the other columns. Then reset the index to get column a back.
df.groupby('a').sum().reset_index()
Kudos to #ouyang-ze for the suggestion.

Related

test/train splits in pycaret using a column for grouping rows that should be in the same split

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888
You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"
One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

Find duplicated Rows based on selected columns condition in a pandas Dataframe

I have an extensive base converted into a dataframe where it is difficult to manually identify the following
The dataframe has columns with the names from_bus and to_bus, which are unique identifiers regardless of the order, for example for element 0:
L_ABAN_MACA_0_1 the associated ordered pair (109,140) is the same as (140,109).
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.476683
2
L_AGOY_BAÑO_1_2
69
66
0.476683
3
L_ALAN_INGA_1_1
189
188
0.452790
4
L_ALAN_INGA_1_2
188
189
0.500450
So I want to identify the duplicate ordered pairs and replace them with a single one, whose column value x_ohn_per_km is defined as the sum of the duplicated values, as follows:
name
from_bus
to_bus
x_ohm_per_km
0
L_ABAN_MACA_0_1
109
140
0.444450
1
L_AGOY_BAÑO_1_1
69
66
0.953366
3
L_ALAN_INGA_1_1
189
188
0.953240
Let us try groupby on from_bus and to_bus after sorting the values in these columns along axis=1 then agg to aggregate the result, optionally reindex to conform the order of columns:
c = ['from_bus', 'to_bus']
df[c] = np.sort(df[c], axis=1)
df.groupby(c, sort=False, as_index=False)\
.agg({'name': 'first', 'x_ohm_per_km': 'sum'})\
.reindex(df.columns, axis=1)
Alternative approach:
d = {**dict.fromkeys(df, 'first'), 'x_ohm_per_km': 'sum'}
df.groupby([*np.sort(df[c], axis=1).T], sort=False, as_index=False).agg(d)
name from_bus to_bus x_ohm_per_km
0 L_ABAN_MACA_0_1 109 140 0.444450
1 L_AGOY_BAÑO_1_1 66 69 0.953366
2 L_ALAN_INGA_1_1 188 189 0.953240

filtering and transposing the dataframe in python3

I made a csv file using pandas and trying to use it as input for the next step. when I open the file using pandas it will look like this example:
example:
Unnamed: 0 Class_Name Probe_Name small_example1.csv small_example2.csv small_example3.csv
0 0 Endogenous CCNO 196 32 18
1 1 Endogenous MYC 962 974 1114
2 2 Endogenous CD79A 390 115 178
3 3 Endogenous FSTL3 67 101 529
4 4 Endogenous VCAN 943 735 9226
I want to make a plot, to do so, I have to change the data structure.
1- I want to remove Unnamed column
2- then I want to make a data frame for a heatmap. to do so I want to use these columns "probe_name", "small_example1.csv", "small_example2.csv" and "small_example3.csv"
3- I also want to transpose the data frame.
here is the expected output:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example1.csv 32 974 115 101 735
small_example1.csv 18 1114 178 529 9226
I tied to do that using the following code:
df = pd.read_csv('myfile.csv')
result = df.transpose()
but it does not return what I want to get. do you know how to fix it?
df.drop(['Unnamed: 0','Class_Name'],axis=1).set_index('Probe_Name').T
Result:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example2.csv 32 974 115 101 735
small_example3.csv 18 1114 178 529 9226
Here's a suggestion:
Changes 1 & 2 can be tackled in one go:
df = df.loc[:, ["Probe_Name", "small_example1.csv", "small_example2.csv", "small_example3.csv"]] # This only retains the specified columns
In order for change 3 (transposing) to work as desired, the column Probe_Name needs to be set as your index:
df = df.set_index("Probe_Name", drop=True)
df = df.transpose()

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

pandas df merge avoid duplicate column names

The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.
Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177
It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.

Resources