How do I get nlargest rows without the sorting? - python-3.x

I need to extract the n-smallest rows of a pandas df, but it is very important to me to maintain the original order of rows.
code example:
import pandas as pd
df = pd.DataFrame({
'a': [1, 10, 8, 11, -1],
'b': list('abdce'),
'c': [1.0, 2.0, 1.5, 3.0, 4.0]})
df.nsmallest(3, 'a')
Gives:
a b c
4 -1 e 4.0
0 1 a 1.0
2 8 d 1.5
I need:
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0
Any ideas how to do that?
PS! In my real example, the index is not sorted/sortable as they are strings (names).

Simplest approach assuming index was sorted in the beginning
df.nsmallest(3, 'a').sort_index()
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0
Alternatively with np.argpartition and iloc
This doesn't depend on sorting the index.emphasized text
df.iloc[np.sort(df.a.values.argpartition(3)[:3])]
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0

Related

Pairwise operations in Scikit-Learn and different filtering conditions on each pair

I have the following 2 data frames, say df1
a b c d
0 0 1 2 3
1 4 0 0 7
2 8 9 10 11
3 0 0 0 15
and df2
a b c d
0 5 1 2 3
What I am interested in doing is a pairwise operation on each row in df1 with the single row in df2. However, if a column in a row of df1 is 0, then that column is used in neither the df1 row nor df2 row to perform the pairwise operation. So each pairwise operation will work on pairs of rows of different length. Let me break it down how the 4 comparison should be.
Comparison 1
0 1 2 3 vs 5 1 2 3
The pairwise operation is done on 1 2 3 vs 1 2 3 as column a has a 0
Comparison 2
4 0 0 7 vs 5 1 2 3 is done on 4 7 vs 5 3 as we have 2 columns that need to be dropped
Comparison 3
8 9 10 11 vs 5 1 2 3 is done on 8 9 10 11 vs 5 1 2 3 as no columns are dropped
Comparison 4
0 0 0 15 vs 5 1 2 3 is done on 15 vs 3 as all but one column is dropped
The result of each pairwise operation is a scalar so the result is some sort of structure whether it be list, array, data frame, whatever with 4 (or the number of rows in df1) values. Also, I should note that values in df2 are irrelevant and no filtering is done based upon the value of any column in df2.
For simplicity, you could try looping over each row in the dataframe and do something like this:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# loop over each row in 'a'
for i in range(len(a)):
# find indicies of non-zero elements of the row
non_zero = np.nonzero(a.iloc[i].to_numpy())[0]
# perform pair-wise addition between non-zero elements in 'a' and the same elements in 'b'
print(np.array(a.iloc[i])[(non_zero)] + np.array(b.iloc[0])[(non_zero)])
Here I used pair-wise addition but you could replace the addition with an operation of your choosing.
Edit:
We may want to vectorize this to avoid the loop if the dataframes are large. Here is an idea for that, where we convert zero values to nan so they are ignored in the row-wise operation:
import pandas as pd
import numpy as np
a = pd.DataFrame(data=[[0,1,2,3],[4,0,0,7],[8,9,10,11],[0,0,0,15]], columns=['a', 'b', 'c', 'd'])
b = pd.DataFrame(data=[[5, 1, 2, 3]], columns=['a', 'b', 'c', 'd'])
# find indicies of zeros
zeros = (a==0).values
# set zeros to nan
a[zeros] = np.nan
# tile and reshape 'b' so its the same shape as 'a'
b = pd.DataFrame(np.tile(b, len(a)).reshape(np.shape(a)), columns=b.columns)
# set the zero indices to nan
b[zeros] = np.nan
print('a:')
print(a)
print('b:')
print(b)
# now do some row-wise operation. For example take the sum of each row
print(np.sum(a+b, axis=1))
Output:
a:
a b c d
0 NaN 1.0 2.0 3
1 4.0 NaN NaN 7
2 8.0 9.0 10.0 11
3 NaN NaN NaN 15
b:
a b c d
0 NaN 1.0 2.0 3
1 5.0 NaN NaN 3
2 5.0 1.0 2.0 3
3 NaN NaN NaN 3
sum:
0 12.0
1 19.0
2 49.0
3 18.0
dtype: float64

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

How to replace selected rows of pandas dataframe with a np array, sequentially?

I have a pandas dataframe
A B C
0 NaN 2 6
1 3.0 4 0
2 NaN 0 4
3 NaN 1 2
where I have a column A that has NaN values in some rows (not necessarily consecutive).
I want to replace these values not with a constant value (which pd.fillna does), but rather with the values from a numpy array.
So the desired outcome is:
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
I'm not sure the .replace method will help here as well, since that seems to replace value <-> value via dictionary. Whereas here I want to sequentially change NaN to its corresponding value (by index) in the np array.
I tried:
MWE:
huh = pd.DataFrame([[np.nan, 2, 6],
[3, 4, 0],
[np.nan, 0, 4],
[np.nan, 1, 2]],
columns=list('ABC'))
huh.A[huh.A.isnull()] = np.array([1,5,7]) # what i want to do, but this gives error
gives the error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
'''
I read the docs but I can't understand how to do this with .loc.
How do I do this properly, preferably without a for loop?
Other info:
The number of elements in the np array will always match the number of NaN in the dataframe, so your answer does not need to check for this.
You are really close, need DataFrame.loc for avoid chained assignments:
huh.loc[huh.A.isnull(), 'A'] = np.array([1,5,7])
print (huh)
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
zip
This should account for uneven lengths
m = huh.A.isna()
a = np.array([1, 5, 7])
s = pd.Series(dict(zip(huh.index[m], a)))
huh.fillna({'A': s})
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2

How to trim and reshape dataframe?

I have df that looks like this:
a b c d e f
1 na 2 3 4 5
1 na 2 3 4 5
1 na 2 3 4 5
1 6 2 3 4 5
How do I trim and reshape the dataframe so that for every column the n/a are dropped and the dataframe looks like this:
Edit;
df.dropna() is dropping all the rows.
a b c d e f
1 6 2 3 4 5
This dataframe has millions of rows, I need to be able to drop the n/a rows by column while retaining rows and columns with data in them.
edit;
df.dropna() is dropping all the rows in the column. When I check if the columns with n/a are empty, df.column_name.empty() I get false. So there is data in columns with n/a
For me dropna working nice for remove missing values and Nones:
df = df.dropna()
print (df)
a b c d e f
3 1 6.0 2 3 4 5
But if possible multiple values for removing create mask by isin, chain testing missing values with isnull and last filter by any - return at least one True per row by inverted mask ~:
df = pd.DataFrame({'a': ['a', None, 's', 'd'],
'b': ['na',7, 2, 6],
'c': [2, 2, 2, 2],
'd': [3, 3, 3, 3],
'e': [4, 4, np.nan, 4],
'f': [5, 5, 5, 5]})
print (df)
a b c d e f
0 a na 2 3 4.0 5
1 None 7 2 3 4.0 5
2 s 2 2 3 NaN 5
3 d 6 2 3 4.0 5
df1 = df.dropna()
print (df1)
a b c d e f
0 a na 2 3 4.0 5
3 d 6 2 3 4.0 5
mask = (df.isin(['na', 'n/a']) | df.isnull()).any(axis=1)
df2 = df[~mask]
print (df2)
a b c d e f
3 d 6 2 3 4.0 5

How to make pandas .mean(0) .std(0) etc. into a list?

I have a df and some of the columns contains numbers and I calculate mean, std, median etc on these columns using df.mean(0)..
How can I put these summary statistics in a list?? One list for mean, one for median etc..
I think you can use Series.tolist, because output of your functions is Series:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#sum(0), std(0) is same as sum(), std() because 0 is by default
L1 = df.sum().tolist()
L2 = df.std().tolist()
print (L1)
print (L2)
[6, 15, 24, 9, 14, 14]
[1.0, 1.0, 1.0, 2.0, 1.5275252316519465, 2.0816659994661326]

Resources