How to sort a pandas dataframe by the standard deviations of its columns? - python-3.x

Given the following example :
from statistics import stdev
d = pd.DataFrame({"a": [45, 55], "b": [5, 95], "c": [30, 70]})
stds = [stdev(d[c]) for c in d.columns]
With output:
In [87]: d
Out[87]:
a b c
0 45 5 30
1 55 95 70
In [91]: stds
Out[91]: [7.0710678118654755, 63.63961030678928, 28.284271247461902]
I would like to be able to sort the columns of the dataframe by their
standard deviations, resulting in the following
b c a
0 5 30 45
1 95 70 55

you are looking for:
d.iloc[:,(-d.std()).argsort()]
Out[8]:
b c a
0 5 30 45
1 95 70 55

You can get the column order like this:
column_order = d.std().sort_values(ascending=False).index
# >>> column_order
# Index(['b', 'c', 'a'], dtype='object')
And then sort the columns like this:
d[column_order]
b c a
0 5 30 45
1 95 70 55

Related

Pandas DataFrame: Same operation on multiple sets of columns

I want to do the same operation on multiple sets of columns of a DataFrame.
Since "for-loops" are frowned upon I'm searching for a decent alternative.
An example:
df = pd.DataFrame({
'a': [1, 11, 111],
'b': [222, 22, 2],
'a_x': [10, 80, 30],
'b_x': [20, 20, 60],
})
This is a simple for-loop approach. It's short and well readable.
cols = ['a', 'b']
for col in cols:
df[f'{col}_res'] = df[[col, f'{col}_x']].min(axis=1)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
This is an alternative (w/o for-loop), but I feel that the additional complexity is not really for the better.
cols = ['a', 'b']
def res_df(df, col, name):
res = pd.Series(
df[[col, f'{col}_x']].min(axis=1), index=df.index, name=name)
return res
res = [res_df(df, col, f'{col}_res') for col in cols]
df = pd.concat([df, pd.concat(res, axis=1)], axis=1)
Does anyone have a better/more pythonic solution?
Thanks!
UPDATE 1
Inspired by the proposal from mozway I find the following solution quite appealing.
Imho it's short, readable and generic, since the particular operation can be swapped into a function and the list comprehension applies the function to the given sets of columns.
def operation(s1, s2):
# fill in any operation on pandas series'
# e.g. res = s1 * s2 / (s1 + s2)
res = np.minimum(s1, s2)
return res
df = df.join(
[operation(df[f'{col}'], df[f'{col}_x']).rename(f'{col}_res') for col in cols]
)
You can use numpy.minimum after setting the arrays to identical column names:
cols = ['a', 'b']
cols2 = [f'{x}_x' for x in cols]
df = df.join(np.minimum(df[cols],
df[cols2].set_axis(cols, axis=1))
.add_suffix('_res'))
output:
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2
or, using rename as suggested in the other answer:
cols = ['a', 'b']
cols2 = {f'{x}_x': x for x in cols}
df = df.join(np.minimum(df[cols],
df[list(cols2)].rename(columns=cols2))
.add_suffix('_res'))
One idea is rename columns names by dictionary, select columns by list cols and then group by columns names with aggregate min, sum, max or use custom function:
cols = ['a', 'b']
suffix = '_x'
d = {f'{x}{suffix}':x for x in cols}
print (d)
{'a_x': 'a', 'b_x': 'b'}
print (df.rename(columns=d)[cols])
a a b b
0 1 10 222 20
1 11 80 22 20
2 111 30 2 60
df1 = df.rename(columns=d)[cols].groupby(axis=1,level=0).min().add_suffix('_res')
print (df1)
a_res b_res
0 1 20
1 11 20
2 30 2
Last add to original DataFrame:
df = df.join(df1)
print (df)
a b a_x b_x a_res b_res
0 1 222 10 20 1 20
1 11 22 80 20 11 20
2 111 2 30 60 30 2

Get the names of Top 'n' columns based on a threshold for values across a row

Let's say, that I have the following data:
In [1]: df
Out[1]:
Student_Name Maths Physics Chemistry Biology English
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
I want to add a column to this dataframe which tells me the students' top 'n' subjects that are above a threshold, where the subject names are available in the column names. Let's assume n=3 and threshold=80.
The output would look like the following:
In [3]: df
Out[3]:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 John Doe 90 87 81 65 70 Maths, Physics, Chemistry
1 Jane Doe 82 84 75 73 77 Physics, Maths
2 Mary Lim 40 65 55 60 70 nan
3 Lisa Ray 55 52 77 62 90 English
I tried to use the solution written by #jezrael for this question where they use numpy.argsort to get the positions of sorted values for the top 'n' columns, but I am unable to set a threshold value below which nothing should be considered.
Idea is first replace not matched values by missing values in DataFrame.where, then applied solution with numpy.argsort. Filter by number of Trues of for count non missing values in numpy.where for replace not matched values to empty strings.
Last are values joined in list comprehension and filtered out non matched rows for missing value(s):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
If performance is not important use Series.nlargest per rows, but it is really slow if large DataFrame:
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m)
.apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)
def f1(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
return df
def f2(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m).apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
return df
In [210]: %timeit (f1(df.copy()))
19.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit (f2(df.copy()))
2.43 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
An alternative:
res = []
tmp = df.set_index('Student_Name').T
for col in list(tmp):
res.append(tmp[col].nlargest(3)[tmp[col].nlargest(3) > 80].index.tolist())
res = [x if len(x) > 0 else np.NaN for x in res]
df['Top_3_above_80'] = res
Output:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 JohnDoe 90 87 81 65 70 [Maths, Physics, Chemistry]
1 JaneDoe 82 84 75 73 77 [Physics, Maths]
2 MaryLim 40 65 55 60 70 NaN
3 LisaRay 55 52 77 62 90 [English]

Performing pair-wise comparisons of some pandas dataframe rows as efficiently as possible

For a given pandas dataframe df, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.
I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).
Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?
An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.
df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(),
how="inner", on=id_str, suffixes=suffixes)
.set_index(['index_1', 'index_2']))
id A_1 B_1 C_1 A_2 B_2 C_2
index_1 index_2
0 0 4 92 79 10 92 79 10
13 4 92 79 10 83 68 69
24 4 92 79 10 67 73 90
25 4 92 79 10 22 31 35
36 4 92 79 10 64 44 20
... .. ... ... ... ... ... ...
9993 9971 7 20 65 92 47 65 21
9977 7 20 65 92 50 35 27
9980 7 20 65 92 43 36 62
9992 7 20 65 92 99 2 17
9993 7 20 65 92 20 65 92

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

Pandas multi-index subtract from value based on value in other column

Given the following data frame:
df = pd.DataFrame({
('A', 'a'): [23, 'n/a',54,7,32,76],
('B', 'b'): [23, 'n/a',54,7,32,76],
('possible','possible'):[100,100,100,100,100,100]
})
df
A B possible
a b
0 23 23 100
1 n/a n/a 100
2 54 54 100
3 7 n/a 100
4 32 32 100
5 76 76 100
I'd like to adjust 'possible', per row, for every instance of 'n/a' such that each instance will subtract 4 from 'possible'.
The desired result is as follows:
A B possible
a b possible
0 23 23 100
1 n/a n/a 92
2 54 54 100
3 7 n/a 96
4 32 32 100
5 76 76 100
Then when that's done, I want every instance of 'n/a' to be converted to 0 so that the column type is integer (but float will do).
Thanks in advance!
Follow-up question:
What if my multi-index is like this:
df = pd.DataFrame({
('A', 'a'): [23, 'n/a',54,7,32,76],
('A', 'b'): [23, 'n/a',54,7,32,76],
('B', 'b'): [23, 'n/a',54,7,32,76],
('possible','possible'):[100,100,100,100,100,100]
})
I have 5 upper level indices and 25 lower level ones. I'm wondering if it's possible to only refer to the top ones in
no4 = (df.loc[:, (top level indices),(bottom level indices)] == 'n/a').sum(axis=1)
I think you can checking values by mask with boolean indexing. Last replace all values n/a to 0:
Check values values with n/a and sum them:
idx = pd.IndexSlice
no4 = (df.loc[:, idx[('A', 'B'), ('a', 'b')]] == 'n/a').sum(axis=1)
print no4
0 0
1 2
2 0
3 1
4 0
5 0
dtype: int64
Check if sum are equal 0 (it means there are n/a values):
mask = no4 != 0
print mask
0 False
1 True
2 False
3 True
4 False
5 False
dtype: bool
Substract 4 times no4:
df.loc[mask, idx['possible', 'possible']] -= no4 * 4
df.replace({'n/a':0}, inplace=True)
print df
A B possible
a b possible
0 23 23 100.0
1 0 0 92.0
2 54 54 100.0
3 7 0 96.0
4 32 32 100.0
5 76 76 100.0
EDIT:
I found more simplier solution - mask is not necessary, becaue you substract 0 if n/a:
idx = pd.IndexSlice
print (df.loc[:, idx[('A', 'B'), ('a', 'b')]] == 'n/a').sum(axis=1) * 4
0 0
1 8
2 0
3 4
4 0
5 0
dtype: int64
df.loc[:, idx['possible', 'possible']] -=
(df.loc[:, idx[('A', 'B'), ('a', 'b')]] == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
print df
A B possible
a b possible
0 23 23 100
1 0 0 92
2 54 54 100
3 7 0 96
4 32 32 100
5 76 76 100
EDIT1: If you need select only tom indices - see using slicers:
(df.loc[:, idx[(top level indices),:]] == 'n/a').sum(axis=1)

Resources