Use list of index's to create a subset df - python-3.x

So i have a dataframe which ive selected certain values from :
x=df[df['column'].str.contains('foo')].index
if i then want to make a new df with the selected indexs from the original df by:
df2=df[x],
the following message pops up:
KeyError: "Int64Index([ 48, 64, 98, 118, 120, 128, 138, 144, 151,\n 166,\n ...\n 15892, 15893, 15894, 15895, 15896, 15897, 15898, 15899, 15900,\n 15901],\n dtype='int64', length=4711) not in index"
those indexs are in the dataframe as df.iloc[48] returns a value
Anyone got any ideas?

I believe you need loc - select by index values:
x=df.index[df['column'].str.contains('foo')]
df2=df.loc[x]
#if default monotonic index - 0,1,..., len(df) - 1
#df2=df.iloc[x]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
x=df.index[df['F'].str.contains('b')]
print (x)
Int64Index([3, 4, 5], dtype='int64')
df2=df.loc[x]
print (df2)
A B C D E F
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
Simplier is use only:
df2=df[df['F'].str.contains('b')]
print (df2)
A B C D E F
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b

Related

pandas fill 0s with mean based on rows that match a condition in another column

I have a dataframe like below in which I need to replace the 0s with the mean of the rows where the parent_key matches the self_key.
Input DataFrame: df= pd.DataFrame ({'self_key':['a','b','c','d','e','e','e','f','f','f'],'parent_key':[np.nan,'a','b','b','c','c','c','d','d','d'], 'value':[0,0,0,0,4,6,14,12,8,22],'level':[1,2,3,3,4,4,4,4,4,4]})
The row 3 has self_key of 'd' so I would need to replace its 0 value in column 'value' with the mean of rows 7,8,9 to fill with the correct value of 14. Since the lower levels feed into the higher levels I would need to do it from lowest level to highest to fill out the dataframe as well but when I do the below code it doesn't work and I get the error "ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional". How can I fill in the 0s with the means from lowest level to highest?
df['value']=np.where((df['value']==0) & (df['level']==3), df['value'].groupby(df.where(df['parent_key']==df['self_key'])).transform('mean'), df['value'])
Input
self_key parent_key value level
0 a NaN 0 1
1 b a 0 2
2 c b 0 3
3 d b 0 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
My approach is to repeat the above code 3 times and change the level from 3 to 2 to 1, but its not working for even level 3.
Expected Ouput:
self_key parent_key value level
0 a NaN 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4
If I understand your problem correctly, you are trying to compute mean in a bottom-up fashion by filtering dataframe on certain keys. If so, then following should solve it:
for l in range(df["level"].max()-1, 0, -1):
df_sub = df[(df["level"] == l) & (df["value"] == 0)]
self_keys = df_sub["self_key"].tolist()
for k in self_keys:
df.loc[df_sub[df_sub["self_key"] == k].index, "value"] = df[df["parent_key"] == k]["value"].mean()
[Out]:
self_key parent_key value level
0 a 11 1
1 b a 11 2
2 c b 8 3
3 d b 14 3
4 e c 4 4
5 e c 6 4
6 e c 14 4
7 f d 12 4
8 f d 8 4
9 f d 22 4

Create new column with a list of max frequency values for each row of a pandas dataframe

Given this Dataframe:
df2 = pd.DataFrame([[3,3,3,3,3,3,5,5,5,5],[2,2,2,2,8,8,8,8,6,6]], columns=list('ABCDEFGHIJ'))
A B C D E F G H I J
0 3 3 3 3 3 3 5 5 5 5
1 2 2 2 2 8 8 8 8 6 6
I created 2 news columns which give for each row the max_freq and the max_freq_value:
df2["max_freq_val"] = df2.apply(lambda x: x.mode().agg(list), axis=1)
df2["max_freq"] = df2.loc[:, df2.columns != "max_freq_val"].apply(lambda x: x.value_counts().max(), axis=1)
A B C D E F G H I J max_freq_val max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
EDIT: I've edited my code inspired by the answer given by #rhug123.
Thanks to all of you for your answers.
Try this, it uses mode()
df2.assign(max_freq=pd.Series(df2.mode(axis=1).stack().groupby(level=0).agg(list)),
max_freq_value = df2.eq(df2.mode(axis=1)[0].squeeze(),axis=0).sum(axis=1))
or
df2.assign(freq = df2.eq((s := df2.mode(axis=1).stack().groupby(level=0).agg(list)).str[0],axis=0).sum(axis=1),val = s)
We can try stack then adjust the freq with agg put the multiple into the list
s = df2.stack().groupby(level=0).value_counts()
s = s[s.eq(s.max(level=0),level=0)].reset_index(level=1).groupby(level=0).agg(val= ('level_1',list),fre=(0,'first'))
df2 = df2.join(s)
df2
Out[156]:
A B C D E F G H I J val fre
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Perhaps you could use this function:
def give_back_maximums(a = [2,2,2,2,8,8,8,8,6,6]):
values, counts = np.unique(a, return_counts=True)
return values[counts >= counts.max()].tolist()
The order of the below could affect the result
df2["max_freq_value"] = df2.apply(lambda x: give_back_maximums(x), axis=1)
df2["max_freq"] = df2.apply(lambda x: x.value_counts().max(), axis=1)
print(df2)
A B C D E F G H I J max_freq_value max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Hope it helps : )

Get value from another dataframe column based on condition

I have a dataframe like below:
>>> df1
a b
0 [1, 2, 3] 10
1 [4, 5, 6] 20
2 [7, 8] 30
and another like:
>>> df2
a
0 1
1 2
2 3
3 4
4 5
I need to create column 'c' in df2 from column 'b' of df1 if column 'a' value of df2 is in coulmn 'a' df1. In df1 each tuple of column 'a' is a list.
I have tried to implement from following url, but got nothing so far:
https://medium.com/#Imaadmkhan1/using-pandas-to-create-a-conditional-column-by-selecting-multiple-columns-in-two-different-b50886fabb7d
expect result is
>>> df2
a c
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
Use Series.map by flattening values from df1 to dictionary:
d = {c: b for a, b in zip(df1['a'], df1['b']) for c in a}
print (d)
{1: 10, 2: 10, 3: 10, 4: 20, 5: 20, 6: 20, 7: 30, 8: 30}
df2['new'] = df2['a'].map(d)
print (df2)
a new
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
EDIT: I think problem is mixed integers in list in column a, solution is use if/else for test it for new dictionary:
d = {}
for a, b in zip(df1['a'], df1['b']):
if isinstance(a, list):
for c in a:
d[c] = b
else:
d[a] = b
df2['new'] = df2['a'].map(d)
Use :
m=pd.DataFrame({'a':np.concatenate(df.a.values),'b':df.b.repeat(df.a.str.len())})
df2.merge(m,on='a')
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
First we unnest the list df1 to rows, then we merge them on column a:
df1 = df1.set_index('b').a.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'a'})
print(df1, '\n')
df_final = df2.merge(df1, on='a')
print(df_final)
b a
0 10 1.0
1 10 2.0
2 10 3.0
0 20 4.0
1 20 5.0
2 20 6.0
0 30 7.0
1 30 8.0
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20

Deleting the first instance in a data frame

I was wondering whats the best way to delete the first instance of a particular index in a Pandas dataframe?
In the example below, I want to delete row 0,5 and 9
Use boolean indexing with Index.duplicated:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}, index=[0,0,1,2,2,2])
print (df)
A B C D E F
0 a 4 7 1 5 a
0 b 5 8 3 3 a
1 c 4 9 5 6 a
2 d 5 4 7 9 b
2 e 5 2 1 2 b
2 f 4 3 0 4 b
df = df[df.index.duplicated()]
print (df)
A B C D E F
0 b 5 8 3 3 a
2 e 5 2 1 2 b
2 f 4 3 0 4 b
Detail:
print (df.index.duplicated())
[False True False False True True]
Heres a way to do it using groupby:
rst = df.reset_index()
df['int_index'] = df.reset_index().index
firsts = df.groupby(df.index).first()
filt = df[~df['int_index'].isin(firsts['int_index'])]
missing = df[df.index.value_counts() == 1]
res = pd.concat([drp, missing]).sort_index().drop('int_index', axis=1)

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Resources