Locate dataframe rows where values are outside bounds specified for each column - python-3.x

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?

One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

Related

Is there a way to sort a list so that rows with the same value in one column are evenly distributed?

Hoping to sort (below left) by sector but distribute evenly (below right):
Name
Sector.
Name.
Sector
A
1
A
1
B
1
E
2
C
1
H
3
D
4
D
4
E
2
B
1
F
2
F
2
G
2
J
3
H
3
I
4
I
4
C
1
J
3
G
2
Real data is 70+ rows with 4 sectors.
I've worked around it manually but would love to figure out how to do it with a formula in excel.
Here's a more complete (and hopefully more accurate) idea - the carouselOrder is the column I'd like to generate via a formula.
guestID
guestSector
carouselOrder
1
1
1
2
1
5
3
1
9
4
1
13
5
2
2
6
2
6
7
2
10
8
2
14
9
3
3
10
3
7
11
3
11
12
2
18
13
1
17
14
1
20
15
1
23
16
2
21
17
2
24
18
2
27
19
1
26
20
1
29
21
1
30
22
1
31
23
3
15
24
3
19
25
3
22
26
3
25
27
3
28
28
1
32
29
4
4
30
4
8
31
4
12
32
4
16
When using Office 365 you can use the following in D2: =MOD(SEQUENCE(COUNTA(A2:A11),,0),4)+1
This create the repetitive counter of the sectors 1 to 4 to the total count of rows in your data.
In C2 use the following:
=BYROW(D2#,LAMBDA(x,
INDEX(
FILTER($A$2:$A$11,$B$2:$B$11=x),
SUM(--(D$2:x=x)))))
This filters the Names that equal the sector of mentioned row and indexes it to show only the result where the row in the filter result equals the count of the same sector (D2#) up to current row.
Let's try the following approach that doesn't require to create a helper column. I would like to explain first the logic to build the recurrence, then the excel formula that builds such recurrence.
If we sort the input data Name and Sector. by Sector. in ascending order, the new positions of the Name values (letters) can be calculated as follow (Table 1):
Name
Sector.Sorted
Position
A
1
1+4*0=1
B
1
1+4*1=5
C
1
1+4*2=9
E
2
2+4*0=2
F
2
2+4*1=6
G
2
2*4*2=10
H
3
3+4*0=3
J
3
3+4*1=7
D
4
4+4*0=4
I
4
4+4*1=8
The new positions of Name (letters) follows this pattern (Formula 1):
position = Sector.Sorted + groupSize * factor
where groupSize is 4 in our case and factor counts how many times the same Sector.Sorted value is repeated, starting from 0. Think about Sector.Sorted as groups, where each set of repeated values represents a group: 1,2,3 and 4.
If we are able to build the Position values we can sort Name, based on the new positions via SORTBY(array, by_array1) function. Check SORTBY documentation for more information how this function works.
Here is the formula to get the Name sorted in cell E2:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
seq0, SEQUENCE(ROWS(sSector),,0), mapResult,
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW")))), factor,
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
Here is the output:
Explanation
The name sorted represents the input data sorted by Sector. in ascending order, i.e.: SORT(A2:B11,2). The names sName and sSector represent each column of sorted.
To identify each group we need the following sequence (seq0) starting from 0, i.e. SEQUENCE(ROWS(sSector),,0).
Now we need to identify when a new group starts. We use MAP function for that and the result is represented by the name mapResult:
MAP(sSector, seq0, LAMBDA(a,b, IF(b=0, "SAME",
IF(a=INDEX(sSector,b), "SAME", "NEW"))))
The logic is the following: If we are at the beginning of the sequence (first value of seq0), then returns SAME otherwise we check current value of sSector (a) against the previous one represented by INDEX(sSector,b) if they are the same, then we are in the same group, otherwise a new group started.
The intermediate result of mapResult is:
Name
Sector Sorted
mapResult
A
1
SAME
B
1
SAME
C
1
SAME
E
2
NEW
F
2
SAME
G
2
SAME
H
3
NEW
J
3
SAME
D
4
NEW
I
4
SAME
The first two columns are shown just for illustrative purpose, but mapResult only returns the last column.
Now we just need to create the counter based on every time we find NEW. In order to do that we use SCAN function and the result is stored under the name factor. This value represents the factor we use to multiply by 4 within each group (see Table 1):
SCAN(-1,mapResult, LAMBDA(aa,c,IF(c="SAME", aa+1,0)))
The accumulator starts in -1, because the counter starts with 0. Every time we find SAME, it increments by 1 the previous value. When it finds NEW (not equal to SAME), the accumulator is reset to 0.
Here is the intermediate result of factor:
Name
Sector Sorted
mapResult
factor
A
1
SAME
0
B
1
SAME
1
C
1
SAME
2
E
2
NEW
0
F
2
SAME
1
G
2
SAME
2
H
3
NEW
0
J
3
SAME
1
D
4
NEW
0
I
4
SAME
1
The first three columns are shown for illustrative purpose.
Now we have all the elements to build our pattern for the new positions represented with the name pos:
MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n))
where m represents each element of Sector.Sorted and factor the previous calculated values. As you can see the formula in Excel represents the generic formula (Formula 1 see above). The intermediate result will be:
Name
Sector Sorted
mapResult
factor
pos
A
1
SAME
0
1
B
1
SAME
1
5
C
1
SAME
2
9
E
2
NEW
0
2
F
2
SAME
1
6
G
2
SAME
2
10
H
3
NEW
0
3
J
3
SAME
1
7
D
4
NEW
0
4
I
4
SAME
1
8
The previous columns are shown just for illustrative purpose. Now we have the new positions, so we are ready to sort based on the new positions for Name via:
SORTBY(sName,pos)
Update
The first MAP can be removed creating an array as input for SCAN that has the information of sSector and the index position to be used for finding the previous element. SCAN only allows a single array as input argument, so we can combine both information in a new array. This is the formula can be used instead:
=LET(groupSize, 4, sorted, SORT(A2:B11,2), sName,
INDEX(sorted,,1),sSector, INDEX(sorted,,2),
factor, SCAN(-1,sSector&"-"&SEQUENCE(ROWS(sSector),,0),
LAMBDA(aa,b, LET(s, TEXTSPLIT(b,"-"),item, INDEX(s,,1),
idx, INDEX(s,,2), IF(aa=-1, 0, IF(1*item=INDEX(sSector, idx), aa+1,0))))),
pos,MAP(sSector, factor, LAMBDA(m,n, m + groupSize*n)),
SORTBY(sName,pos)
)
We use inside of SCAN a LET function to calculate all required elements for doing the comparison as part of the calculation of the corresponding LAMBDA function. We extract the item and the idx position used to find previous element of sSector via:
1*item=INDEX(sSector, idx)
we are able to compare each element of sSector with previous one, starting from the second element of sSector. We multiply item by 1, because TEXTSPLIT converts the result to text, otherwise the comparison will fail.

Python for-loop to change row value based on a condition works correctly but does not change the values on pandas dataframe?

I am just getting into Python, and I am trying to make a for-loop that loops on every row and randomly select two columns on each iteration based on a given condition and change their values. The for-loop works without any problems; however, the results don't change on the dataframe.
A reproducible example:
df= pd.DataFrame({'A': [10,40,10,20,10],
'B': [10,10,50,40,50],
'C': [10,20,10,10,10],
'D': [10,30,10,10,50],
'E': [10,10,40,10,10],
'F': [2,3,2,2,3]})
df:
A B C D E F
0 10 10 10 10 10 2
1 40 10 20 30 10 3
2 10 50 10 10 40 2
3 20 40 10 10 10 2
4 10 50 10 50 10 3
This is my for-loop; the for loop iterates on all rows and check if the value on column F = 2; it randomly selects two columns with value 10 and change them to 100.
for index, i in df.iterrows():
if i['F'] == 2:
i[i==10].sample(2, axis=0)+100
print(i[i==10].sample(2, axis=0)+100)
This is the output of the loop:
E 110
C 110
Name: 0, dtype: int64
C 110
D 110
Name: 2, dtype: int64
C 110
D 110
Name: 3, dtype: int64
This is what the dataframe is expected to look like:
df:
A B C D E F
0 10 10 110 10 110 2
1 40 10 20 30 10 3
2 10 50 110 110 40 2
3 20 40 110 110 10 2
4 10 50 10 50 10 3
However, the columns on the dataframe are not change. Any idea what's going wrong?
This line:
i[i==10].sample(2, axis=0)+100
.sample returns a new dataframe so the original dataframe (df) was not updated at all.
Try this:
for index, i in df.iterrows():
if i['F'] == 2:
cond = (i == 10)
# You can only sample 2 rows if there are at
# least 2 rows meeting the condition
if cond.sum() >= 2:
idx = i[cond].sample(2).index
i[idx] += 100
print(i[idx])
You should not modify the original df in place. Make a copy and iterate:
df2 = df.copy()
for index, i in df.iterrows():
if i['F'] == 2:
s = i[i==10].sample(2, axis=0)+100
df2.loc[index,i.index.isin(s.index)] = s

How to replace rows with character value by integers in a column in pandas dataframe?

I am working on one large dataset, the problem am facing is that there are columns that have all integer values, however, as the dataset is uncleaned there are a few rows where there are 'characters' along with integers. Here am trying to illustrate the problem with a small pandas dataframe example,
I have the following dataframe:
Index
l1
l2
l3
0
1
123
23
1
2
Z3V
343
2
3
321
21
3
4
AZ34
345
4
5
432
3
With dataframe code :
l1,l2,l3 = [1,2,3,4,5], [123, 'Z3V', 321, 'AZ34', 432], [23,343,21,345,3]
data = pd.DataFrame(zip(l1,l2,l3), columns=['l1', 'l2', 'l3'])
print(data)
Here as you can see, column 'l2' at rows index 1 and 3 have 'characters' along with integers. I want to find such rows in this particular column and print them. Later I want to replace them with integer values like 100 or something similar integer. i.e. those numbers that I am replacing with will be different for example, am replacing instances of 'Z3V' with 100 and instances of 'AZ34' with 101. My point is to replace characters containing values with integers. Now, if in 'l2' column, 'Z3V' occurs again, there too, I will replace it with 100.
Expected output :
Index
l1
l2
l3
0
1
123
23
1
2
100
343
2
3
321
21
3
4
101
345
4
5
432
3
As you can see, the two instances where there were characters have been replaced with 100 and 101 respectively
How to get this expected output ?
You could do:
import pandas as pd
import numpy as np
# setup
l1, l2, l3 = [1, 2, 3, 4, 5, 6], [123, 'Z3V', 321, 'AZ34', 432, 'Z3V'], [23, 343, 21, 345, 3, 3]
data = pd.DataFrame(zip(l1, l2, l3), columns=['l1', 'l2', 'l3'])
# find all non numeric values across the whole DataFrame
mask = data.applymap(np.isreal)
rows, cols = np.where(~mask)
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
# apply the replacements
res = data.replace(replacements)
print(res)
Output
l1 l2 l3
0 1 123 23
1 2 101 343
2 3 321 21
3 4 100 345
4 5 432 3
5 6 101 3
Note that I added an extra row to verify the desire behaviour, now the data DataFrame looks like:
l1 l2 l3
0 1 123 23
1 2 Z3V 343
2 3 321 21
3 4 AZ34 345
4 5 432 3
5 6 Z3V 3
By changing this line:
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
you can change the replacement values as you see fit.

loops application in dataframe to find output

I have the following data:
dict={'A':[1,2,3,4,5],'B':[10,20,233,29,2],'C':[10,20,3040,230,238]...................}
and
df= pd.Dataframe(dict)
In this manner I have 20 columns with 5 numerical entry in each column
I want to have a new column where the value should come as the following logic:
0 A[0]*B[0]+A[0]*C[0] + A[0]*D[0].......
1 A[1]*B[1]+A[1]*C[1] + A[1]*D[1].......
2 A[2]*B[2]+A[2]*B[2] + A[2]*D[2].......
I tried in the following manner but manually I can not put 20 columns, so I wanted to know the way to apply a loop to get the desired output
:
lst=[]
for i in range(0,5):
j=df.A[i]*df.B[i]+ df.A[i]*df.C[i]+.......
lst.append(j)
i=i+1
A potential solution is the following. I am only taking the example you posted but is works fine for more. Your data is df
A B C
0 1 10 10
1 2 20 20
2 3 233 3040
3 4 29 230
4 5 2 238
You can create a new column, D by first subsetting your dataframe
add = df.loc[:, df.columns != 'A']
and then take the sum over all multiplications of the columns in D with column A in the following way:
df['D'] = df['A']*add.sum(axis=1)
which returns
A B C D
0 1 10 10 20
1 2 20 20 80
2 3 233 3040 9819
3 4 29 230 1036
4 5 2 238 1200

Binning with pd.Cut Beyond range(replacing Nan with "<min_val" or ">Max_val" )

df= pd.DataFrame({'days': [0,31,45,35,19,70,80 ]})
df['range'] = pd.cut(df.days, [0,30,60])
df
Here as code is reproduced , where pd.cut is used to convert a numerical column to categorical column . pd.cut usually gives category as per the list passed [0,30,60]. In this row's 0 , 5 & 6 categorized as Nan which is beyond the [0,30,60]. what i want is 0 should categorized as <0 & 70 should categorized as >60 and similarly 80 should categorized as >60 respectively, If possible dynamic text labeling of A,B,C,D,E depending on no of category created.
For the first part, adding -np.inf and np.inf to the bins will ensure that everything gets a bin:
In [5]: df= pd.DataFrame({'days': [0,31,45,35,19,70,80]})
...: df['range'] = pd.cut(df.days, [-np.inf, 0, 30, 60, np.inf])
...: df
...:
Out[5]:
days range
0 0 (-inf, 0.0]
1 31 (30.0, 60.0]
2 45 (30.0, 60.0]
3 35 (30.0, 60.0]
4 19 (0.0, 30.0]
5 70 (60.0, inf]
6 80 (60.0, inf]
For the second, you can use .cat.codes to get the bin index and do some tweaking from there:
In [8]: df['range'].cat.codes.apply(lambda x: chr(x + ord('A')))
Out[8]:
0 A
1 C
2 C
3 C
4 B
5 D
6 D
dtype: object

Resources