selective averaging of dataframe entries depending on labels

selective averaging of dataframe entries depending on labels - python-3.x

I have a dataframe
ID KD DT
0 4 2 5.6
1 4 5 8.7
4 4 8 1.9
5 4 9 1.7
6 4 1 8.8
3 4 3 7.2
9 4 4 3.1
I also have an array of labels, same size as the total number of unique KD
L = [ 0, 0, 0, 1, 1, 1, 1] which simply indicates that KD == 1 is associated with label 0 KD == 2 with label 0 ... KD == 9 with label 1 etc. (L is stored for the sorted order of KD).
Now I have a two lists, l1 = [1,2,5,9] and l2 = [3,4,8]. I want to set the value of DT corresponding to the KD values in l2 such that it is the average of the DT values in l1 if both have the same labels.
In the example, KD == 3 as the same label (label = 0) as KD = 1 and 2 in l1. So we set DT = (8.8 + 5.6)/2 = 7.2.
I am now doing this using a for loop, by iterating over l2 and finding the l1 entries which has the same labels and then averaging. Is there a way that I can do this very efficiently, by getting rid of the for loop ?
My output can be a dictionary of the form
d = {3:7.2, 4: 5.2, 8: 5.2}

IIUC, first set_index the KD column, then you can select 'DT' and with where replace values that are not isin(l1) with Nan. then you groupby.transform the map of the column KD with their group number in L and get the mean. Finally loc only the KD that are isin(l2) and use to_dict to get your expected output
df_ = df.set_index('KD')
print ( df_['DT'].where(df_.index.isin(l1))\
.groupby(df_.index.map(pd.Series(L, df_.index.sort_values())))\
.transform('mean')\
.loc[df_.index.isin(l2)]\
.to_dict()
)
{8: 5.199999999999999, 3: 7.2, 4: 5.199999999999999}

Related

pandas listing same indexes

if a table has the same index 3 times in a row, I want it to fetch me this dataframe.
example
index var1
1 a
2 b
2 c
2 d
3 e
2 f
5 g
2 f
After the code
expected output
index var1
2 b
2 c
2 d

One option is to split data frame on the diff index, check size of each chunk and filter out chunks with sizes smaller then threshold and then recombine them:
import pandas as pd
import numpy as np
diff_indices = np.flatnonzero(df['index'].diff().ne(0))
diff_indices
# array([0, 1, 4, 5, 6, 7], dtype=int32)
pd.concat([chunk for chunk in np.split(df, diff_indices) if len(chunk) >= 3])
index var1
1 2 b
2 2 c
3 2 d

Let us identify the blocks of consecutive indices using cumsum, then group and transform with count to find the size of each block then select the rows where the block size > 2
b = df['index'].diff().ne(0).cumsum()
df[b.groupby(b).transform('count') > 2]
index var1
1 2 b
2 2 c
3 2 d

You can assign consecutive rows to same value by comparing with next and cumsum. Then groupby consecutive rows and keep the group where number of rows are 3 times
m = df['index'].ne(df['index'].shift()).cumsum()
out = df.groupby(m).filter(lambda col: len(col) == 3)
print(out)
index var1
1 2 b
2 2 c
3 2 d

Here's one more solution on top of the ones above (this one is more generalizable, since it selects ALL slices that meet the given criterium):
import pandas as pd
df['diff_index'] = df['index'].diff(-1) # calcs the index diff
df = df.fillna(999) # get rid of NaNs
df['diff_index'] = df['diff_index'].astype(int) # convert the diff to int
df_selected = [] # create a list of all dfs we're going to slice
l = list(df['diff_index'])
for i in range(len(l)-1):
if l[i] == 0 and l[i+1] == 0: # if 2 consecutive 0s are found, get the slice
df_temp = df[df.index.isin([i,i+1,i+2])]
del df_temp['diff_index']
df_selected.append(df_temp) # append the slice to our list
print(df_selected) # list all identified data frames (in your example, there will be only one
[ index var1
1 2 b
2 2 c
3 2 d]

python tuple compare with specific number

I have this piece of code
import itertools
values = [1, 2, 3, 4]
per = itertools.permutations(values, 2)
hyp = 3
for val in per:
print(*val)
Output:
1 2
1 3
1 4
2 1
2 3
2 4
3 1
3 2
3 4
4 1
4 2
4 3
I want to compare each tuple with value of hyp (e.g. 3). If each tuple has value less than or equal to hyp it keeps it and if condition doesn't meet, It discard it.
In this case the tuples (4,1),(4,2),(4,3) should be removed.
in other words,
Based on hyp value it takes pair.
If hyp =2 then from value list it output should be like this
1 2
1 3
1 4
2 1
2 3
2 4
I am not sure whether i explained my problem clearly or not. Let me know if it is unclear.

This will do it. You just need to extract the zero index of each tuple and compare it to hyp:
import itertools
values = [1, 2, 3, 4]
per = itertools.permutations(values, 2)
hyp = 3
for tup in per:
if tup[0] <= hyp:
print(*tup)

Sorting by absolute value for value different of zero for one column keeping equal value of another column together

I have the following dataframe :
A B C
============
11 x 2
11 y 0
13 x -10
13 y 0
10 x 7
10 y 0
and i would like to sort C by absolute value for value different of 0. But as i need to keep A values together it would look like below (sorted by absolute value but with 0 in between):
A B C
============
13 x -10
13 y 0
10 x 7
10 y 0
11 x 2
11 y 0
I can't manage to obtain this with sort_values(). If i sort by C, i don't have A values together.

Step 1: get absolute values
# creating a column with the absolute values
df["abs_c"] = df["c"].abs()
Step 2: sort values on absolute values of "c"
# sorting by absolute value of "c" & reseting the index & assigning it back to df
df = df.sort_values("abs_c",ascending=False).reset_index(drop=True)
Step 3: get the order of column "a" based on the sorted values, this is achieved by using drop duplicates of pandas which keeps the first instance of the value in the column a which is sorted based on "c". This will be used in the next step
# getting the order of "a" based on sorted value of "c"
order_a = df["a"].drop_duplicates()
Step 4: based on the order of "a" and the sorted values of "c" creating a data frame
# based on the order_a creating a data frame as per the order_a which is based on the sorted values of abs "c"
sorted_df = pd.DataFrame()
for i in range(len(order_a)):
sorted_df = sorted_df.append(df[df["a"]==order_a[i]])
Step 5:Assigning the sorted df back to df
# reset index of sorted values and assigning it back to df
df = sorted_df.reset_index(drop=True)
Output
a b c abs_c
0 13 x -10 10
1 13 y 0 0
2 10 x 7 7
3 10 y 0 0
4 11 x 2 2
5 11 y 0 0
Doc reference
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Sorry, it doesn't turn out very nice, but I almost never use panda. I hope everything works out the way you want it.
import pandas as pd
df = pd.DataFrame({'a': [11, 11, 13, 13, 10, 10],
'b': ['x', 'y', 'x', 'y', 'x', 'y'],
'c': [2, 0, -10, 0, 7, 0]})
mask = df[df['c'] != 0]
mask['abs'] = mask['c'].abs()
mask = mask.sort_values('abs', ascending=False).reset_index()
tempNr = 0
for index, row in df.iterrows():
if row['c'] != 0:
df.loc[index] = mask.loc[tempNr].drop('abs')
tempNr = tempNr + 1
print(df)

if specific value/string occurs in the entire dataframe I want to sum its index values

i have a dataframe in which I need to find a specific image name in the entire dataframe and sum its index values every time they are found. SO my data frame looks like:
c 1 2 3 4
g
0 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg
1 1209270004-2.jpg 180609-2-31.jpg 1209270004-2.jpg 1209270004-2.jpg
2 1209270004-1.jpg 180414-2-38.jpg 180707-1-31.jpg 1209050002-1.jpg
3 1708260004-1.jpg 1209270004-2.jpg 180609-2-31.jpg 1209270004-1.jpg
4 1108220001-5.jpg 1209270004-1.jpg 1108220001-5.jpg 1108220001-2.jpg
I need to find the 1209270004-2.jpg in entire dataframe. And as it is found at index 1 and 3 I want to add the index values so it should be
1+3+1+1=6.
I tried the code:
img_fname = '1209270004-2.jpg'
df2 = df1[df1.eq(img_fname).any(1)]
sum = int(np.sum(df2.index.values))
print(sum)
I am getting the answer of sum 4 i.e 1+3=4. But it should be 6.
If the string occurence is only once or twice or thrice or four times like for eg 180707-1-31 is in column 3. then the sum should be 45+45+3+45 = 138. Which signifies that if the string is not present in the dataframe take vallue as 45 instead the index value.

You can multiple boolean mask by index values and then sum:
img_fname = '1209270004-1.jpg'
s = df1.eq(img_fname).mul(df1.index.to_series(), 0).sum()
print (s)
1 2
2 4
3 0
4 3
dtype: int64
out = np.where(s == 0, 45, s).sum()
print (out)
54

If dataset does not have many columns, this can also work with your original question
df1 = pd.DataFrame({"A":["aa","ab", "cd", "ab", "aa"], "B":["ab","ab", "ab", "aa", "ab"]})
s = 0
for i in df1.columns:
s= s+ sum(df1.index[df1.loc[:,i] == "ab"].tolist())
Input :
A B
0 aa ab
1 ab ab
2 cd ab
3 ab aa
4 aa ab
Output :11
Based on second requirement:

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.

You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

selective averaging of dataframe entries depending on labels - python-3.x

Related

pandas listing same indexes

python tuple compare with specific number

Sorting by absolute value for value different of zero for one column keeping equal value of another column together

if specific value/string occurs in the entire dataframe I want to sum its index values

Pandas Flag Rows with Complementary Zeros

Categories

Resources