Given this code
bins = pd.IntervalIndex.from_tuples([(0, 2), (2,3), (3,6)])
df['bin']=pd.cut(df.num, bins, labels=False)
The result is
num bin
0 1 (0, 2]
1 2 (0, 2]
2 3 (2, 3]
3 4 (3, 6]
4 5 (3, 6]
5 6 (3, 6]
but I hope the result to be
num bin
0 1 1
1 2 1
2 3 2
3 4 3
4 5 3
5 6 3
i.e., use an integer to represent the bin range, how to achive this?

it turns out df['bin_num']=df['bin'] will do


New DataFrame column that contains IDs where value is outside bounds?

I have the following DataFrame :
data: Dict[str, list[int]] = {
"x1": [5 , 6, 7, 8, 9],
"min1": [3 , 3, 3, 3, 3],
"max1": [8, 8, 8, 8, 8],
"x2": [0 , 1, 2, 3, 4],
"min2": [2 , 2, 2, 2, 2],
"max2": [7, 7, 7, 7, 7],
"x3": [7 , 6, 7, 6, 7],
"min3": [1 , 1, 1, 1, 1],
"max3": [6, 6, 6, 6, 6],
n: int = 3 # number of xi
df: pd.DataFrame = pd.DataFrame(data=data)
x1 min1 max1 x2 min2 max2 x3 min3 max3
0 5 3 8 0 2 7 7 1 6
1 6 3 8 1 2 7 6 1 6
2 7 3 8 2 2 7 7 1 6
3 8 3 8 3 2 7 6 1 6
4 9 3 8 4 2 7 7 1 6
I would like to add a new column alert to df that contains the IDs i where xi < mini or xi > maxi.
Expected result
x1 min1 max1 x2 min2 max2 x3 min3 max3 alert
0 5 3 8 0 2 7 7 1 6 "2,3"
1 6 3 8 1 2 7 6 1 6 "2"
2 7 3 8 2 2 7 7 1 6 "3"
3 8 3 8 3 2 7 6 1 6 ""
4 9 3 8 4 2 7 7 1 6 "1,3"
I looked at this answer but could not understand how to apply it to my problem.
Below is my working implementation that I wish to improve.
def f(row: pd.Series) -> str:
alert: str = ""
for k in range(1, n+1):
if row[f"x{k}"] < row[f"min{k}"] or row[f"x{k}"] > row[f"max{k}"]:
alert += f"{k}"
return ",".join(list(alert))
df["alert"] = df.apply(f, axis=1)
Actually given your output as strings, your approach isn't too bad. I would just suggest making alert a list, not a string:
def f(row: pd.Series) -> str:
alert: list = []
for k in range(1, n+1):
if row[f"x{k}"] < row[f"min{k}"] or row[f"x{k}"] > row[f"max{k}"]:
return ",".join(alert)
In a bit fancy way, you can do:
xs = df.filter(regex='^x')
mins = df.filter(like='min').to_numpy()
maxes = df.filter(like='max').to_numpy()
mask = (xs < mins) | (xs > maxes)
df['alert'] = ( mask # xs.columns.str.replace('x',',')).str.replace('^,','')
We can groupby to dataframe along columns according to integer it contains
df['alert'] = (df.groupby(df.columns.str.extract('(\d+)$')[0].tolist(), axis=1)
.apply(lambda g: g[f'x{}'].le(g[f'min{}']) | g[f'x{}'].gt(g[f'max{}']))
.apply(lambda row: ','.join(row.index[row]), axis=1))
x1 min1 max1 x2 min2 max2 x3 min3 max3 alert
0 5 3 8 0 2 7 7 1 6 2,3
1 6 3 8 1 2 7 6 1 6 2
2 7 3 8 2 2 7 7 1 6 2,3
3 8 3 8 3 2 7 6 1 6
4 9 3 8 4 2 7 7 1 6 1,3
Intermediate result
(df.groupby(df.columns.str.extract('(\d+)$')[0].tolist(), axis=1)
.apply(lambda g: g[f'x{}'].le(g[f'min{}']) | g[f'x{}'].gt(g[f'max{}'])))
1 2 3
0 False True True
1 False True False
2 False True True
3 False False False
4 True False True
Using pandas:
a = (pd.wide_to_long(df.reset_index(), ['x', 'min', 'max'],'index', 'alert')
.loc[lambda x: x['x'].lt(x['min']) | x['x'].gt(x['max'])]
.groupby('index')['alert'].agg(lambda x: ','.join(x.astype(str))))
x1 min1 max1 x2 min2 max2 x3 min3 max3 alert
0 5 3 8 0 2 7 7 1 6 2,3
1 6 3 8 1 2 7 6 1 6 2
2 7 3 8 2 2 7 7 1 6 3
3 8 3 8 3 2 7 6 1 6 NaN
4 9 3 8 4 2 7 7 1 6 1,3

Creating dataframe with multi level column index from from four 2d numpy arrays

I haveĀ four 2d numpy arrays:
import numpy as np
import pandas as pd
x1 = np.array([[2, 4, 1],
[2, 2, 1],
[1, 3, 3],
[2, 2, 1],
[3, 3, 2]])
x2 = np.array([[1, 2, 2],
[4, 1, 4],
[1, 4, 4],
[3, 3, 2],
[2, 2, 4]])
x3 = np.array([[4, 3, 2],
[4, 3, 2],
[4, 3, 3],
[1, 2, 2],
[1, 4, 3]])
x4 = np.array([[3, 1, 1],
[3, 4, 3],
[2, 2, 1],
[2, 1, 1],
[1, 2, 4]])
And I would like to create a dataframe as following:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
df = pd.DataFrame(np.concatenate((x1,x1,x3,x4),axis=1), columns=header) = 'Time'
Data in this DataFrame is not in the desired form.
I want the four columns (x1,x2,x3,x4) in the first level column label (location1) should be created by taking the first columns from all the numpy arrays. The next four columns (x1,x2,x3,x4) ie. the four columns in the second first level column label (location2) should be created by taking second columns from all four numpy arrays and so on. The length of first level column label ie. len(level_1_label) will be equal to the number of columns in all four 2d numpy arrays.
Desired DataFrame:
One option is to reverse the order in creating the MultiIndex column (since level_1_label corresponds to the columns and level_2_label corresponds to the arrays); then swaplevel + sort_index (to get it in the desired order) after building the DataFrame:
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_2_label, level_1_label], names=['Variable','Location'])
df = pd.DataFrame(np.concatenate((x1,x2,x3,x4),axis=1), columns=header).swaplevel(axis=1).sort_index(level=0, axis=1) = 'Time'
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4
One option is to reshape the data in Fortran order, before creating the dataframe:
# reusing your code
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])
# np.vstack is just a convenience wrapper around np.concatenate, axis=1
outcome = np.reshape(np.vstack([x1,x2,x3,x4]), (len(x1), -1), order = 'F')
df = pd.DataFrame(outcome, columns = header) = 'Time'
Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4

I've a data frame and one of its columns contains a list.
0 5 [3, 4]
1 4 [1, 1]
2 1 [7, 7]
3 3 [0, 2]
4 5 [3, 3]
5 4 [2, 2]
The output should look like this:
A x y
0 5 3 4
1 4 1 1
2 1 7 7
3 3 0 2
4 5 3 3
5 4 2 2
I have tried these options that I found here but its not working.
df = pd.DataFrame(data={"A":[0,1],
df['x'] = df['B'].apply(lambda x:x[0])
df['y'] = df['B'].apply(lambda x:x[1])
A x y
0 0 3 4
1 1 1 1
Incase the list is stored as string
from ast import literal_eval
df = pd.DataFrame(data={"A":[0,1],
df['x'] = df['B'].apply(lambda x:literal_eval(x)[0])
df['y'] = df['B'].apply(lambda x:literal_eval(x)[1])
3rd way credit goes to #anky_91
df = pd.DataFrame(data={"A":[0,1],
df["B"] = df["B"].apply(lambda x :literal_eval(x))

Set 3 level of column names in pandas DataFrame

I'm trying to have a frame with the following structure
h/a totales
sub1 sub2 sub1 sub2
a b ... f g ....m a b ... f g ....m
That being, 2 labels for the first layer, again 2 labels for the second one, and then a subset of column names where sub1 and sub2 doesn't have the same column names.
In order to do so I did the following:
names=['data level 1','data level 2','data level 3']])
What I get is this error:
>ValueError: Shape of passed values is (1, 21), indices imply (84, 21)
How can I fix this to have a multi leveled frame by column names?
Thank you
I think need MultiIndex.from_tuples from list comprehensions:
L1 = list('abc')
L2 = list('ghi')
tups = ([('h/a','means', x) for x in L1] +
[('h/a','percentage', x) for x in L2] +
[('totals','means', x) for x in L1] +
[('totals','percentage', x) for x in L2])
columnas=pd.MultiIndex.from_tuples(tups, names=['data level 1','data level 2','data level 3'])
print (columnas)
MultiIndex(levels=[['h/a', 'totals'],
['means', 'percentage'],
['a', 'b', 'c', 'g', 'h', 'i']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
names=['data level 1', 'data level 2', 'data level 3'])
#some random data
data = np.random.randint(10, size=(3, 12))
print (data)
[[8 0 4 1 2 5 4 1 4 1 1 8]
[1 5 0 7 4 8 4 1 3 8 0 2]
[5 9 4 9 4 6 3 7 0 5 2 1]]
print (newframe)
data level 1 h/a totals
data level 2 means percentage means percentage
data level 3 a b c g h i a b c g h i
0 8 0 4 1 2 5 4 1 4 1 1 8
1 1 5 0 7 4 8 4 1 3 8 0 2
2 5 9 4 9 4 6 3 7 0 5 2 1

Why is stratifiedkfold generating the same splits in spite of using different random_state values?

I am trying to generate different stratified splits of my data set using stratifiedkfold split and random_state parameter. However, when I use different random_state values, I still get the same splits. My understanding is that by using different random_state values, you will be able to generate different splits. Please let me know what I am doing incorrectly. Here is the code.
import numpy as np
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5,random_state=0)
skf1 = StratifiedKFold(n_splits=5,random_state=100)
for train, cv in skf.split(X_train, Y_train):
for train, cv in skf1.split(X_train, Y_train):
for c in list(range(0,5)):
Here is the output
[2 3 4 5 6 7 8 9]
[2 3 4 5 6 7 8 9]
[0 1]
[0 1]
[0 1 4 5 6 7 8 9]
[0 1 4 5 6 7 8 9]
[2 3]
[2 3]
[0 1 2 3 6 7 8 9]
[0 1 2 3 6 7 8 9]
[4 5]
[4 5]
[0 1 2 3 4 5 8 9]
[0 1 2 3 4 5 8 9]
[6 7]
[6 7]
[0 1 2 3 4 5 6 7]
[0 1 2 3 4 5 6 7]
[8 9]
[8 9]
As stated in the documentation:
random_state : int, RandomState instance or None, optional, default=None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when shuffle == True.
So simply add shuffle=True to your StratifiedKFold calls. For example:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
skf1 = StratifiedKFold(n_splits=5, shuffle=True, random_state=100)
[0 1 3 4 5 6 7 9]
[0 1 2 3 4 5 8 9]
[2 8]
[6 7]
[0 1 2 3 5 6 7 8]
[0 2 3 4 6 7 8 9]
[4 9]
[1 5]
[0 2 3 4 5 7 8 9]
[0 1 3 5 6 7 8 9]
[1 6]
[2 4]
[0 1 2 4 5 6 8 9]
[1 2 4 5 6 7 8 9]
[3 7]
[0 3]
[1 2 3 4 6 7 8 9]
[0 1 2 3 4 5 6 7]
[0 5]
[8 9]
