Conditional column selection in pandas - python-3.x

I want to select columns from a DataFrame according to a particular condition. I know it can be done with a loop, but my df is very large so efficiency is crucial. The condition for column selection is having either only non-nan entries or a sequence of only nans followed by a sequence of only non-nan entries.
Here is an example. Consider the following DataFrame:
pd.DataFrame([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])
0 1 2 3
0 1 NaN 2.0 NaN
1 2 NaN 5.0 NaN
2 4 8.0 NaN 1.0
3 3 2.0 NaN 2.0
4 3 2.0 5.0 NaN
From it, I would like to select only columns 0 and 1. Any advice on how to do this efficiently without looping?

logic
count the nulls in each column. if the only nulls are in the beginning, then the number of nulls in the column should be equal the the position of the first valid index.
get the first valid index
slice the index by the null count and compare against the first valid indices. If they are equal, then thats a good column
cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
df.loc[:, cols]
Edited with speed improvements
old answer
def pir1(df):
cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
return df.loc[:, cols]
much faster answer using same logic
def pir2(df):
nulls = np.isnan(df.values)
null_count = nulls.sum(0)
first_valid = nulls.argmin(0)
null_on_top = null_count == first_valid
filtered_data = df.values[:, null_on_top]
filtered_columns = df.columns.values[null_on_top]
return pd.DataFrame(filtered_data, df.index, filtered_columns)

Consider a DF as shown which has Nans in various possible locations:
1. Both sides Nans present:
Create a mask by replacing all nans with 0's and finite values with 1's:
mask = np.where(np.isnan(df), 0, 1)
Take it's corresponding element difference across each column. Next, take modulus of it's values. Logic here is that whenever there are three unique values in each column, then discard that column(namely → -1,1,0) as there would be a break in the sequence for such a situation.
Idea is to take the sum and create a subset wherever the sum results in a value less than 2.(As after taking mod, we get 1,1,0). So, for the extreme case, we get sum as 2 and those columns certainly are disjoint and must be discarded.
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).abs().sum().lt(2)
Finally transpose the DF and use this condition and re-transpose to get the desired result having only Nans in one portion and finite values in the other.
df.loc[:, criteria]
2. Nans present on top:
mask = np.where(np.isnan(df), 0, 1)
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).ne(-1).any()
df.loc[:, criteria]

Related

Why Pandas index will keep same when I explode it and How does iloc work in the same index? [duplicate]

Can someone explain how these two methods of slicing are different?
I've seen the docs,
and I've seen these answers, but I still find myself unable to understand how the three are different. To me, they seem interchangeable in large part, because they are at the lower levels of slicing.
For example, say we want to get the first five rows of a DataFrame. How is it that these two work?
df.loc[:5]
df.iloc[:5]
Can someone present three cases where the distinction in uses are clearer?
Once upon a time, I also wanted to know how these two functions differ from df.ix[:5] but ix has been removed from pandas 1.0, so I don't care anymore.
Label vs. Location
The main distinction between the two methods is:
loc gets rows (and/or columns) with particular labels.
iloc gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:
<object>
description
s.loc[<object>]
s.iloc[<object>]
0
single item
Value at index label 0 (the string 'd')
Value at index location 0 (the string 'a')
0:1
slice
Two rows (labels 0 and 1)
One row (first row at location 0)
1:47
slice with out-of-bounds end
Zero rows (empty Series)
Five rows (location 1 onwards)
1:47:-1
slice with negative step
three rows (labels 1 back to 47)
Zero rows (empty Series)
[2, 0]
integer list
Two rows with given labels
Two rows with given locations
s > 'e'
Bool series (indicating which values have the property)
One row (containing 'f')
NotImplementedError
(s>'e').values
Bool array
One row (containing 'f')
Same as loc
999
int object not in index
KeyError
IndexError (out of bounds)
-1
int object not in index
KeyError
Returns last value in s
lambda x: x.index[3]
callable applied to series (here returning 3rd item in index)
s.loc[s.index[3]]
s.iloc[s.index[3]]
loc's label-querying capabilities extend well-beyond integer indexes and it's worth highlighting a couple of additional examples.
Here's a Series where the index contains string objects:
>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a 49
b 48
c 47
d 0
e 1
f 2
Since loc is label-based, it can fetch the first value in the Series using s2.loc['a']. It can also slice with non-integer objects:
>>> s2.loc['c':'e'] # all rows lying between 'c' and 'e' (inclusive)
c 47
d 0
e 1
For DateTime indexes, we don't need to pass the exact date/time to fetch by label. For example:
>>> s3 = pd.Series(list('abcde'), pd.date_range('now', periods=5, freq='M'))
>>> s3
2021-01-31 16:41:31.879768 a
2021-02-28 16:41:31.879768 b
2021-03-31 16:41:31.879768 c
2021-04-30 16:41:31.879768 d
2021-05-31 16:41:31.879768 e
Then to fetch the row(s) for March/April 2021 we only need:
>>> s3.loc['2021-03':'2021-04']
2021-03-31 17:04:30.742316 c
2021-04-30 17:04:30.742316 d
Rows and Columns
loc and iloc work the same way with DataFrames as they do with Series. It's useful to note that both methods can address columns and rows together.
When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.
Consider the DataFrame defined below:
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
Then for example:
>>> df.loc['c': , :'z'] # rows 'c' and onwards AND columns up to 'z'
x y z
c 10 11 12
d 15 16 17
e 20 21 22
>>> df.iloc[:, 3] # all rows, but only the column at index location 3
a 3
b 8
c 13
d 18
e 23
Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of loc and iloc.
For example, consider the following DataFrame. How best to slice the rows up to and including 'c' and take the first four columns?
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
We can achieve this result using iloc and the help of another method:
>>> df.iloc[:df.index.get_loc('c') + 1, :4]
x y z 8
a 0 1 2 3
b 5 6 7 8
c 10 11 12 13
get_loc() is an index method meaning "get the position of the label in this index". Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row 'c' as well.
iloc works based on integer positioning. So no matter what your row labels are, you can always, e.g., get the first row by doing
df.iloc[0]
or the last five rows by doing
df.iloc[-5:]
You can also use it on the columns. This retrieves the 3rd column:
df.iloc[:, 2] # the : in the first position indicates all rows
You can combine them to get intersections of rows and columns:
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)
On the other hand, .loc use named indices. Let's set up a data frame with strings as row and column labels:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Then we can get the first row by
df.loc['a'] # equivalent to df.iloc[0]
and the second two rows of the 'date' column by
df.loc['b':, 'date'] # equivalent to df.iloc[1:, 1]
and so on. Now, it's probably worth pointing out that the default row and column indices for a DataFrame are integers from 0 and in this case iloc and loc would work in the same way. This is why your three examples are equivalent. If you had a non-numeric index such as strings or datetimes, df.loc[:5] would raise an error.
Also, you can do column retrieval just by using the data frame's __getitem__:
df['time'] # equivalent to df.loc[:, 'time']
Now suppose you want to mix position and named indexing, that is, indexing using names on rows and positions on columns (to clarify, I mean select from our data frame, rather than creating a data frame with strings in the row index and integers in the column index). This is where .ix comes in:
df.ix[:2, 'time'] # the first two rows of the 'time' column
I think it's also worth mentioning that you can pass boolean vectors to the loc method as well. For example:
b = [True, False, True]
df.loc[b]
Will return the 1st and 3rd rows of df. This is equivalent to df[b] for selection, but it can also be used for assigning via boolean vectors:
df.loc[b, 'name'] = 'Mary', 'John'
In my opinion, the accepted answer is confusing, since it uses a DataFrame with only missing values. I also do not like the term position-based for .iloc and instead, prefer integer location as it is much more descriptive and exactly what .iloc stands for. The key word is INTEGER - .iloc needs INTEGERS.
See my extremely detailed blog series on subset selection for more
.ix is deprecated and ambiguous and should never be used
Because .ix is deprecated we will only focus on the differences between .loc and .iloc.
Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each index. Let's take a look at a sample DataFrame:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used for the index.
The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.
.loc selects data only by labels
We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length of the DataFrame.
There are three different inputs you can use for .loc
A string
A list of strings
Slice notation using strings as the start and stop values
Selecting a single row with .loc with a string
To select a single row of data, place the index label inside of the brackets following .loc.
df.loc['Penelope']
This returns the row of data as a Series
age 4
color white
food Apple
height 80
score 3.3
state AL
Name: Penelope, dtype: object
Selecting multiple rows with .loc with a list of strings
df.loc[['Cornelia', 'Jane', 'Dean']]
This returns a DataFrame with the rows in the order specified in the list:
Selecting multiple rows with .loc with slice notation
Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.
df.loc['Aaron':'Dean']
Complex slices can be taken in the same manner as Python lists.
.iloc selects data only by integer location
Let's now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.
There are three different inputs you can use for .iloc
An integer
A list of integers
Slice notation using integers as the start and stop values
Selecting a single row with .iloc with an integer
df.iloc[4]
This returns the 5th row (integer location 4) as a Series
age 32
color gray
food Cheese
height 180
score 1.8
state AK
Name: Dean, dtype: object
Selecting multiple rows with .iloc with a list of integers
df.iloc[[2, -2]]
This returns a DataFrame of the third and second to last rows:
Selecting multiple rows with .iloc with slice notation
df.iloc[:5:3]
Simultaneous selection of rows and columns with .loc and .iloc
One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.
For example, we can select rows Jane, and Dean with just the columns height, score and state like this:
df.loc[['Jane', 'Dean'], 'height':]
This uses a list of labels for the rows and slice notation for the columns
We can naturally do similar operations with .iloc using only integers.
df.iloc[[1,4], 2]
Nick Lamb
Dean Cheese
Name: food, dtype: object
Simultaneous selection with labels and integer location
.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.
For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:
col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names]
Or alternatively, convert the index labels to integers with the get_loc index method.
labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
Boolean Selection
The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows wher age is above 30 and return just the food and score columns we can do the following:
df.loc[df['age'] > 30, ['food', 'score']]
You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:
df.iloc[(df['age'] > 30).values, [2, 4]]
Selecting all rows
It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:
df.loc[:, 'color':'score':2]
The indexing operator, [], can select rows and columns too but not simultaneously.
Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.
df['food']
Jane Steak
Nick Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Using a list selects multiple columns
df[['food', 'score']]
What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.
df['Penelope':'Christina'] # slice rows by label
df[2:6:2] # slice rows by integer location
The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.
df[3:5, 'color']
TypeError: unhashable type: 'slice'
.loc and .iloc are used for indexing, i.e., to pull out portions of data. In essence, the difference is that .loc allows label-based indexing, while .iloc allows position-based indexing.
If you get confused by .loc and .iloc, keep in mind that .iloc is based on the index (starting with i) position, while .loc is based on the label (starting with l).
.loc
.loc is supposed to be based on the index labels and not the positions, so it is analogous to Python dictionary-based indexing. However, it can accept boolean arrays, slices, and a list of labels (none of which work with a Python dictionary).
iloc
.iloc does the lookup based on index position, i.e., pandas behaves similarly to a Python list. pandas will raise an IndexError if there is no index at that location.
Examples
The following examples are presented to illustrate the differences between .iloc and .loc. Let's consider the following series:
>>> s = pd.Series([11, 9], index=["1990", "1993"], name="Magic Numbers")
>>> s
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.iloc Examples
>>> s.iloc[0]
11
>>> s.iloc[-1]
9
>>> s.iloc[4]
Traceback (most recent call last):
...
IndexError: single positional indexer is out-of-bounds
>>> s.iloc[0:3] # slice
1990 11
1993 9
Name: Magic Numbers , dtype: int64
>>> s.iloc[[0,1]] # list
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.loc Examples
>>> s.loc['1990']
11
>>> s.loc['1970']
Traceback (most recent call last):
...
KeyError: ’the label [1970] is not in the [index]’
>>> mask = s > 9
>>> s.loc[mask]
1990 11
Name: Magic Numbers , dtype: int64
>>> s.loc['1990':] # slice
1990 11
1993 9
Name: Magic Numbers, dtype: int64
Because s has string index values, .loc will fail when
indexing with an integer:
>>> s.loc[0]
Traceback (most recent call last):
...
KeyError: 0
This example will illustrate the difference:
df = pd.DataFrame({'col1': [1,2,3,4,5], 'col2': ["foo", "bar", "baz", "foobar", "foobaz"]})
col1 col2
0 1 foo
1 2 bar
2 3 baz
3 4 foobar
4 5 foobaz
df = df.sort_values('col1', ascending = False)
col1 col2
4 5 foobaz
3 4 foobar
2 3 baz
1 2 bar
0 1 foo
Index based access:
df.iloc[0, 0:2]
col1 5
col2 foobaz
Name: 4, dtype: object
We get the first row of the sorted dataframe. (This is not the row with index 0, but with index 4).
Position based access:
df.loc[0, 'col1':'col2']
col1 1
col2 foo
Name: 0, dtype: object
We get the row with index 0, even when the df is sorted.
DataFrame.loc() : Select rows by index value
DataFrame.iloc() : Select rows by rows number
Example:
Select first 5 rows of a table, df1 is your dataframe
df1.iloc[:5]
Select first A, B rows of a table, df1 is your dataframe
df1.loc['A','B']

How to add series to a Dataframe? [duplicate]

I have the following indexed DataFrame with named columns and rows not- continuous numbers:
a b c d
2 0.671399 0.101208 -0.181532 0.241273
3 0.446172 -0.243316 0.051767 1.577318
5 0.614758 0.075793 -0.451460 -0.012493
I would like to add a new column, 'e', to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).
0 -0.335485
1 -1.166658
2 -0.385571
dtype: float64
How can I add column e to the above example?
Edit 2017
As indicated in the comments and by #Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:
df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
Edit 2015
Some reported getting the SettingWithCopyWarning with this code.
However, the code still runs perfectly with the current pandas version 0.16.1.
>>> sLength = len(df1['a'])
>>> df1
a b c d
6 -0.269221 -0.026476 0.997517 1.294385
8 0.917438 0.847941 0.034235 -0.448948
>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e
6 -0.269221 -0.026476 0.997517 1.294385 1.757167
8 0.917438 0.847941 0.034235 -0.448948 2.228131
>>> pd.version.short_version
'0.16.1'
The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead
>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
a b c d e f
6 -0.269221 -0.026476 0.997517 1.294385 1.757167 -0.050927
8 0.917438 0.847941 0.034235 -0.448948 2.228131 0.006109
>>>
In fact, this is currently the more efficient method as described in pandas docs
Original answer:
Use the original df1 indexes to create the series:
df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
This is the simple way of adding a new column: df['e'] = e
I would like to add a new column, 'e', to the existing data frame and do not change anything in the data frame. (The series always got the same length as a dataframe.)
I assume that the index values in e match those in df1.
The easiest way to initiate a new column named e, and assign it the values from your series e:
df['e'] = e.values
assign (Pandas 0.16.0+)
As of Pandas 0.16.0, you can also use assign, which assigns new columns to a DataFrame and returns a new object (a copy) with all the original columns in addition to the new ones.
df1 = df1.assign(e=e.values)
As per this example (which also includes the source code of the assign function), you can also include more than one column:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
>>> df.assign(mean_a=df.a.mean(), mean_b=df.b.mean())
a b mean_a mean_b
0 1 3 1.5 3.5
1 2 4 1.5 3.5
In context with your example:
np.random.seed(0)
df1 = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])
mask = df1.applymap(lambda x: x <-0.7)
df1 = df1[-mask.any(axis=1)]
sLength = len(df1['a'])
e = pd.Series(np.random.randn(sLength))
>>> df1
a b c d
0 1.764052 0.400157 0.978738 2.240893
2 -0.103219 0.410599 0.144044 1.454274
3 0.761038 0.121675 0.443863 0.333674
7 1.532779 1.469359 0.154947 0.378163
9 1.230291 1.202380 -0.387327 -0.302303
>>> e
0 -1.048553
1 -1.420018
2 -1.706270
3 1.950775
4 -0.509652
dtype: float64
df1 = df1.assign(e=e.values)
>>> df1
a b c d e
0 1.764052 0.400157 0.978738 2.240893 -1.048553
2 -0.103219 0.410599 0.144044 1.454274 -1.420018
3 0.761038 0.121675 0.443863 0.333674 -1.706270
7 1.532779 1.469359 0.154947 0.378163 1.950775
9 1.230291 1.202380 -0.387327 -0.302303 -0.509652
The description of this new feature when it was first introduced can be found here.
Super simple column assignment
A pandas dataframe is implemented as an ordered dict of columns.
This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.
For example, this dataframe can have a column added to it by simply using the [] accessor
size name color
0 big rose red
1 small violet blue
2 small tulip red
3 small harebell blue
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note that this works even if the index of the dataframe is off.
df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
[]= is the way to go, but watch out!
However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be "smart"
What actually is going on.
When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series
Side note
This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.
Going around the problem
If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.
You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
or
df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
But this is not very explicit.
Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".
Explicit way
Setting the index of the pd.Series to be the index of the df is explicit.
df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
Or more realistically, you probably have a pd.Series already available.
protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index
3 no
2 no
1 no
0 yes
Can now be assigned
df['protected'] = protected_series
size name color protected
3 big rose red no
2 small violet blue no
1 small tulip red no
0 small harebell blue yes
Alternative way with df.reset_index()
Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.
df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series
size name color protected
0 big rose red no
1 small violet blue no
2 small tulip red no
3 small harebell blue yes
Note on df.assign
While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=
df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
size name color protected
3 big rose red yes
2 small violet blue no
1 small tulip red no
0 small harebell blue no
Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.
df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'
You may say, "Well, I'll just not use self then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.
It seems that in recent Pandas versions the way to go is to use df.assign:
df1 = df1.assign(e=np.random.randn(sLength))
It doesn't produce SettingWithCopyWarning.
Doing this directly via NumPy will be the most efficient:
df1['e'] = np.random.randn(sLength)
Note my original (very old) suggestion was to use map (which is much slower):
df1['e'] = df1['a'].map(lambda x: np.random.random())
Easiest ways:-
data['new_col'] = list_of_values
data.loc[ : , 'new_col'] = list_of_values
This way you avoid what is called chained indexing when setting new values in a pandas object. Click here to read further.
If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None
This actually would assign "object" type to the cell. So later you're free to put complex data types, like list, into individual cells.
I got the dreaded SettingWithCopyWarning, and it wasn't fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:
df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength), index=df.index))
This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign
Works well for data flow type where you don't overwrite your intermediate values.
First create a python's list_of_e that has relevant data.
Use this:
df['e'] = list_of_e
To create an empty column
df['i'] = None
If the column you are trying to add is a series variable then just :
df["new_columns_name"]=series_variable_name #this will do it for you
This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.
If the data frame and Series object have the same index, pandas.concat also works here:
import pandas as pd
df
# a b c d
#0 0.671399 0.101208 -0.181532 0.241273
#1 0.446172 -0.243316 0.051767 1.577318
#2 0.614758 0.075793 -0.451460 -0.012493
e = pd.Series([-0.335485, -1.166658, -0.385571])
e
#0 -0.335485
#1 -1.166658
#2 -0.385571
#dtype: float64
# here we need to give the series object a name which converts to the new column name
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df
# a b c d e
#0 0.671399 0.101208 -0.181532 0.241273 -0.335485
#1 0.446172 -0.243316 0.051767 1.577318 -1.166658
#2 0.614758 0.075793 -0.451460 -0.012493 -0.385571
In case they don't have the same index:
e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)
Foolproof:
df.loc[:, 'NewCol'] = 'New_Val'
Example:
df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
3 -0.147354 0.778707 0.479145 2.284143
4 -0.529529 0.000571 0.913779 1.395894
5 2.592400 0.637253 1.441096 -0.631468
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
8 0.606985 -2.232903 -1.358107 -2.855494
9 -0.692013 0.671866 1.179466 -1.180351
10 -1.093707 -0.530600 0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
18 0.693458 0.144327 0.329500 -0.655045
19 0.104425 0.037412 0.450598 -0.923387
df.drop([3, 5, 8, 10, 18], inplace=True)
df
A B C D
0 -0.761269 0.477348 1.170614 0.752714
1 1.217250 -0.930860 -0.769324 -0.408642
2 -0.619679 -1.227659 -0.259135 1.700294
4 -0.529529 0.000571 0.913779 1.395894
6 0.757178 0.240012 -0.553820 1.177202
7 -0.986128 -1.313843 0.788589 -0.707836
9 -0.692013 0.671866 1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728 0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832 0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15 0.955298 -1.430019 1.434071 -0.088215
16 -0.227946 0.047462 0.373573 -0.111675
17 1.627912 0.043611 1.743403 -0.012714
19 0.104425 0.037412 0.450598 -0.923387
df.loc[:, 'NewCol'] = 0
df
A B C D NewCol
0 -0.761269 0.477348 1.170614 0.752714 0
1 1.217250 -0.930860 -0.769324 -0.408642 0
2 -0.619679 -1.227659 -0.259135 1.700294 0
4 -0.529529 0.000571 0.913779 1.395894 0
6 0.757178 0.240012 -0.553820 1.177202 0
7 -0.986128 -1.313843 0.788589 -0.707836 0
9 -0.692013 0.671866 1.179466 -1.180351 0
11 -0.143273 -0.503199 -1.328728 0.610552 0
12 -0.923110 -1.365890 -1.366202 -1.185999 0
13 -2.026832 0.273593 -0.440426 -0.627423 0
14 -0.054503 -0.788866 -0.228088 -0.404783 0
15 0.955298 -1.430019 1.434071 -0.088215 0
16 -0.227946 0.047462 0.373573 -0.111675 0
17 1.627912 0.043611 1.743403 -0.012714 0
19 0.104425 0.037412 0.450598 -0.923387 0
One thing to note, though, is that if you do
df1['e'] = Series(np.random.randn(sLength), index=df1.index)
this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,
data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)
to insert a new column at a given location (0 <= loc <= amount of columns) in a data frame, just use Dataframe.insert:
DataFrame.insert(loc, column, value)
Therefore, if you want to add the column e at the end of a data frame called df, you can use:
e = [-0.335485, -1.166658, -0.385571]
DataFrame.insert(loc=len(df.columns), column='e', value=e)
value can be a Series, an integer (in which case all cells get filled with this one value), or an array-like structure
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html
Let me just add that, just like for hum3, .loc didn't solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by "fake" chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.
Also note that if you know what you are doing, you can switch of the warning using
pd.options.mode.chained_assignment = None
and than use one of the other solutions given here.
Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:
data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])
To add a new column, 'e', to the existing data frame
df1.loc[:,'e'] = Series(np.random.randn(sLength))
I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.
From the following:
the answers here
this question about passing a variable as a keyword argument
this method for generating a numpy array of NaNs in-line
I came up with this:
col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})
For the sake of completeness - yet another solution using DataFrame.eval() method:
Data:
In [44]: e
Out[44]:
0 1.225506
1 -1.033944
2 -0.498953
3 -0.373332
4 0.615030
5 -0.622436
dtype: float64
In [45]: df1
Out[45]:
a b c d
0 -0.634222 -0.103264 0.745069 0.801288
4 0.782387 -0.090279 0.757662 -0.602408
5 -0.117456 2.124496 1.057301 0.765466
7 0.767532 0.104304 -0.586850 1.051297
8 -0.103272 0.958334 1.163092 1.182315
9 -0.616254 0.296678 -0.112027 0.679112
Solution:
In [46]: df1.eval("e = #e.values", inplace=True)
In [47]: df1
Out[47]:
a b c d e
0 -0.634222 -0.103264 0.745069 0.801288 1.225506
4 0.782387 -0.090279 0.757662 -0.602408 -1.033944
5 -0.117456 2.124496 1.057301 0.765466 -0.498953
7 0.767532 0.104304 -0.586850 1.051297 -0.373332
8 -0.103272 0.958334 1.163092 1.182315 0.615030
9 -0.616254 0.296678 -0.112027 0.679112 -0.622436
If you just need to create a new empty column then the shortest solution is:
df.loc[:, 'e'] = pd.Series()
The following is what I did... But I'm pretty new to pandas and really Python in general, so no promises.
df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))
newCol = [3,5,7]
newName = 'C'
values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)
df = pd.DataFrame(values,columns=header)
If we want to assign a scaler value eg: 10 to all rows of a new column in a df:
df = df.assign(new_col=lambda x:10) # x is each row passed in to the lambda func
df will now have new column 'new_col' with value=10 in all rows.
If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.
df = df.copy()
df['col_name'] = values
x=pd.DataFrame([1,2,3,4,5])
y=pd.DataFrame([5,4,3,2,1])
z=pd.concat([x,y],axis=1)
4 ways you can insert a new column to a pandas DataFrame
using simple assignment, insert(), assign() and Concat() methods.
import pandas as pd
df = pd.DataFrame({
'col_a':[True, False, False],
'col_b': [1, 2, 3],
})
print(df)
col_a col_b
0 True 1
1 False 2
2 False 3
Using simple assignment
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
print(ser)
0 a
1 b
2 c
dtype: object
df['col_c'] = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
print(df)
col_a col_b col_c
0 True 1 NaN
1 False 2 a
2 False 3 b
Using assign()
e = pd.Series([1.0, 3.0, 2.0], index=[0, 2, 1])
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
df.assign(colC=s.values, colB=e.values)
col_a col_b col_c
0 True 1.0 a
1 False 3.0 b
2 False 2.0 c
Using insert()
df.insert(len(df.columns), 'col_c', ser.values)
print(df)
col_a col_b col_c
0 True 1 a
1 False 2 b
2 False 3 c
Using concat()
ser = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
df = pd.concat([df, ser.rename('colC')], axis=1)
print(df)
col_a col_b col_c
0 True 1.0 NaN
1 False 2.0 NaN
2 False 3.0 NaN
10 NaN NaN a
20 NaN NaN b
30 NaN NaN c
this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.
so, let our dataFrame has columns 'feature_1', 'feature_2', 'probability_score' and we have to add a new_column 'predicted_class' based on data in column 'probability_score'.
I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.
data = pd.read_csv('data.csv')
def myFunction(x):
//implement your logic here
if so and so:
return a
return b
variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)
data['predicted_class'] = predicted_class
// check dataFrame, new column is included based on an existing column data for each row
data.head()
Whenever you add a Series object as new column to an existing DF, you need to make sure that they both have the same index.
Then add it to the DF
e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f
import pandas as pd
# Define a dictionary containing data
data = {'a': [0,0,0.671399,0.446172,0,0.614758],
'b': [0,0,0.101208,-0.243316,0,0.075793],
'c': [0,0,-0.181532,0.051767,0,-0.451460],
'd': [0,0,0.241273,1.577318,0,-0.012493]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# Declare a list that is to be converted into a column
col_e = [-0.335485,-1.166658,-0.385571,0,0,0]
df['e'] = col_e
# add column 'e'
df['e'] = col_e
# Observe the result
df

How to search values from list and apply np.where condition? [duplicate]

Let’s say I have the following Pandas dataframe:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
I can subset based on a specific value:
x = df[df['A'] == 3]
x
A B
2 3 3
But how can I subset based on a list of values? - something like this:
list_of_values = [3,6]
y = df[df['A'] in list_of_values]
To get:
A B
1 6 2
2 3 3
You can use the isin method:
In [1]: df = pd.DataFrame({'A': [5,6,3,4], 'B': [1,2,3,5]})
In [2]: df
Out[2]:
A B
0 5 1
1 6 2
2 3 3
3 4 5
In [3]: df[df['A'].isin([3, 6])]
Out[3]:
A B
1 6 2
2 3 3
And to get the opposite use ~:
In [4]: df[~df['A'].isin([3, 6])]
Out[4]:
A B
0 5 1
3 4 5
You can use the method query:
df.query('A in [6, 3]')
# df.query('A == [6, 3]')
or
lst = [6, 3]
df.query('A in #lst')
# df.query('A == #lst')
Another method;
df.loc[df.apply(lambda x: x.A in [3,6], axis=1)]
Unlike the isin method, this is particularly useful in determining if the list contains a function of the column A. For example, f(A) = 2*A - 5 as the function;
df.loc[df.apply(lambda x: 2*x.A-5 in [3,6], axis=1)]
It should be noted that this approach is slower than the isin method.
You can store your values in a list as:
lis = [3,6]
then
df1 = df[df['A'].isin(lis)]
list_of_values doesn't have to be a list; it can be set, tuple, dictionary, numpy array, pandas Series, generator, range etc. and isin() and query() will still work.
Some common problems with selecting rows
1. list_of_values is a range
If you need to filter within a range, you can use between() method or query().
list_of_values = [3, 4, 5, 6] # a range of values
df[df['A'].between(3, 6)] # or
df.query('3<=A<=6')
2. Return df in the order of list_of_values
In the OP, the values in list_of_values don't appear in that order in df. If you want df to return in the order they appear in list_of_values, i.e. "sort" by list_of_values, use loc.
list_of_values = [3, 6]
df.set_index('A').loc[list_of_values].reset_index()
If you want to retain the old index, you can use the following.
list_of_values = [3, 6, 3]
df.reset_index().set_index('A').loc[list_of_values].reset_index().set_index('index').rename_axis(None)
3. Don't use apply
In general, isin() and query() are the best methods for this task; there's no need for apply(). For example, for function f(A) = 2*A - 5 on column A, both isin() and query() work much more efficiently:
df[(2*df['A']-5).isin(list_of_values)] # or
df[df['A'].mul(2).sub(5).isin(list_of_values)] # or
df.query("A.mul(2).sub(5) in #list_of_values")
4. Select rows not in list_of_values
To select rows not in list_of_values, negate isin()/in:
df[~df['A'].isin(list_of_values)]
df.query("A not in #list_of_values") # df.query("A != #list_of_values")
5. Select rows where multiple columns are in list_of_values
If you want to filter using both (or multiple) columns, there's any() and all() to reduce columns (axis=1) depending on the need.
Select rows where at least one of A or B is in list_of_values:
df[df[['A','B']].isin(list_of_values).any(1)]
df.query("A in #list_of_values or B in #list_of_values")
Select rows where both of A and B are in list_of_values:
df[df[['A','B']].isin(list_of_values).all(1)]
df.query("A in #list_of_values and B in #list_of_values")
Bonus:
You can also call isin() inside query():
df.query("A.isin(#list_of_values).values")
Its trickier with f-Strings
list_of_values = [3,6]
df.query(f'A in {list_of_values}')
The above answers are correct, but if you still are not able to filter out rows as expected, make sure both DataFrames' columns have the same dtype.
source = source.astype({1: 'int64'})
to_rem = to_rem.astype({'some col': 'int64'})
works = source[~source[1].isin(to_rem['some col'])]
Took me long enough.
A non pandas solution that compares in terms of speed may be:
filtered_column = set(df.A) - set(list_list_of_values)

conditionally multiply values in DataFrame row

here is an example DataFrame:
df = pd.DataFrame([[1,0.5,-0.3],[0,-4,7],[1,0.12,-.06]], columns=['condition','value1','value2'])
I would like to apply a function which multiples the values ('value1' and 'value2' in each row by 100, if the value in the 'condition' column of that row is equal to 1, otherwise, it is left as is.
presumably some usage of .apply with a lambda function would work here but I am not able to get the syntax right. e.g.
df.apply(lambda x: 100*x if x['condition'] == 1, axis=1)
will not work
the desired output after applying this operation would be:
As simple as
df.loc[df.condition==1,'value1':]*=100
import numpy as np
df['value1'] = np.where(df['condition']==1,df['value1']*100,df['value1']
df['value2'] = np.where(df['condition']==1,df['value2']*100,df['value2']
In case multiple columns
# create a list of columns you want to apply condition
columns_list = ['value1','value2']
for i in columns_list:
df[i] = np.where(df['condition']==1,df[i]*100,df[i]
Use df.loc[] with the condition and filter the list of cols to operate then multiply:
l=['value1','value2'] #list of cols to operate on
df.loc[df.condition.eq(1),l]=df.mul(100)
#if condition is just 0 and 1 -> df.loc[df.condition.astype(bool),l]=df.mul(100)
print(df)
Another solution using df.mask() using same list of cols as above:
df[l]=df[l].mask(df.condition.eq(1),df[l]*100)
print(df)
condition value1 value2
0 1 50.0 -30.0
1 0 -4.0 7.0
2 1 12.0 -6.0
Use a mask to filter and where it is true choose second argument where false choose third argument is how np.where works
value_cols = ['value1','value2']
mask = (df.condition == 1)
df[value_cols] = pd.np.where(mask[:, None], df[value_cols].mul(100), df[value_cols])
If you have multiple value columns such as value1, value2 ... and so on, Use
value_cols = df.filter(regex='value\d').columns

Correct syntax for accessing a row in Pandas dataframe [duplicate]

Can someone explain how these two methods of slicing are different?
I've seen the docs,
and I've seen these answers, but I still find myself unable to understand how the three are different. To me, they seem interchangeable in large part, because they are at the lower levels of slicing.
For example, say we want to get the first five rows of a DataFrame. How is it that these two work?
df.loc[:5]
df.iloc[:5]
Can someone present three cases where the distinction in uses are clearer?
Once upon a time, I also wanted to know how these two functions differ from df.ix[:5] but ix has been removed from pandas 1.0, so I don't care anymore.
Label vs. Location
The main distinction between the two methods is:
loc gets rows (and/or columns) with particular labels.
iloc gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:
<object>
description
s.loc[<object>]
s.iloc[<object>]
0
single item
Value at index label 0 (the string 'd')
Value at index location 0 (the string 'a')
0:1
slice
Two rows (labels 0 and 1)
One row (first row at location 0)
1:47
slice with out-of-bounds end
Zero rows (empty Series)
Five rows (location 1 onwards)
1:47:-1
slice with negative step
three rows (labels 1 back to 47)
Zero rows (empty Series)
[2, 0]
integer list
Two rows with given labels
Two rows with given locations
s > 'e'
Bool series (indicating which values have the property)
One row (containing 'f')
NotImplementedError
(s>'e').values
Bool array
One row (containing 'f')
Same as loc
999
int object not in index
KeyError
IndexError (out of bounds)
-1
int object not in index
KeyError
Returns last value in s
lambda x: x.index[3]
callable applied to series (here returning 3rd item in index)
s.loc[s.index[3]]
s.iloc[s.index[3]]
loc's label-querying capabilities extend well-beyond integer indexes and it's worth highlighting a couple of additional examples.
Here's a Series where the index contains string objects:
>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a 49
b 48
c 47
d 0
e 1
f 2
Since loc is label-based, it can fetch the first value in the Series using s2.loc['a']. It can also slice with non-integer objects:
>>> s2.loc['c':'e'] # all rows lying between 'c' and 'e' (inclusive)
c 47
d 0
e 1
For DateTime indexes, we don't need to pass the exact date/time to fetch by label. For example:
>>> s3 = pd.Series(list('abcde'), pd.date_range('now', periods=5, freq='M'))
>>> s3
2021-01-31 16:41:31.879768 a
2021-02-28 16:41:31.879768 b
2021-03-31 16:41:31.879768 c
2021-04-30 16:41:31.879768 d
2021-05-31 16:41:31.879768 e
Then to fetch the row(s) for March/April 2021 we only need:
>>> s3.loc['2021-03':'2021-04']
2021-03-31 17:04:30.742316 c
2021-04-30 17:04:30.742316 d
Rows and Columns
loc and iloc work the same way with DataFrames as they do with Series. It's useful to note that both methods can address columns and rows together.
When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.
Consider the DataFrame defined below:
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
Then for example:
>>> df.loc['c': , :'z'] # rows 'c' and onwards AND columns up to 'z'
x y z
c 10 11 12
d 15 16 17
e 20 21 22
>>> df.iloc[:, 3] # all rows, but only the column at index location 3
a 3
b 8
c 13
d 18
e 23
Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of loc and iloc.
For example, consider the following DataFrame. How best to slice the rows up to and including 'c' and take the first four columns?
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
We can achieve this result using iloc and the help of another method:
>>> df.iloc[:df.index.get_loc('c') + 1, :4]
x y z 8
a 0 1 2 3
b 5 6 7 8
c 10 11 12 13
get_loc() is an index method meaning "get the position of the label in this index". Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row 'c' as well.
iloc works based on integer positioning. So no matter what your row labels are, you can always, e.g., get the first row by doing
df.iloc[0]
or the last five rows by doing
df.iloc[-5:]
You can also use it on the columns. This retrieves the 3rd column:
df.iloc[:, 2] # the : in the first position indicates all rows
You can combine them to get intersections of rows and columns:
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)
On the other hand, .loc use named indices. Let's set up a data frame with strings as row and column labels:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Then we can get the first row by
df.loc['a'] # equivalent to df.iloc[0]
and the second two rows of the 'date' column by
df.loc['b':, 'date'] # equivalent to df.iloc[1:, 1]
and so on. Now, it's probably worth pointing out that the default row and column indices for a DataFrame are integers from 0 and in this case iloc and loc would work in the same way. This is why your three examples are equivalent. If you had a non-numeric index such as strings or datetimes, df.loc[:5] would raise an error.
Also, you can do column retrieval just by using the data frame's __getitem__:
df['time'] # equivalent to df.loc[:, 'time']
Now suppose you want to mix position and named indexing, that is, indexing using names on rows and positions on columns (to clarify, I mean select from our data frame, rather than creating a data frame with strings in the row index and integers in the column index). This is where .ix comes in:
df.ix[:2, 'time'] # the first two rows of the 'time' column
I think it's also worth mentioning that you can pass boolean vectors to the loc method as well. For example:
b = [True, False, True]
df.loc[b]
Will return the 1st and 3rd rows of df. This is equivalent to df[b] for selection, but it can also be used for assigning via boolean vectors:
df.loc[b, 'name'] = 'Mary', 'John'
In my opinion, the accepted answer is confusing, since it uses a DataFrame with only missing values. I also do not like the term position-based for .iloc and instead, prefer integer location as it is much more descriptive and exactly what .iloc stands for. The key word is INTEGER - .iloc needs INTEGERS.
See my extremely detailed blog series on subset selection for more
.ix is deprecated and ambiguous and should never be used
Because .ix is deprecated we will only focus on the differences between .loc and .iloc.
Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each index. Let's take a look at a sample DataFrame:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used for the index.
The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.
.loc selects data only by labels
We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length of the DataFrame.
There are three different inputs you can use for .loc
A string
A list of strings
Slice notation using strings as the start and stop values
Selecting a single row with .loc with a string
To select a single row of data, place the index label inside of the brackets following .loc.
df.loc['Penelope']
This returns the row of data as a Series
age 4
color white
food Apple
height 80
score 3.3
state AL
Name: Penelope, dtype: object
Selecting multiple rows with .loc with a list of strings
df.loc[['Cornelia', 'Jane', 'Dean']]
This returns a DataFrame with the rows in the order specified in the list:
Selecting multiple rows with .loc with slice notation
Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.
df.loc['Aaron':'Dean']
Complex slices can be taken in the same manner as Python lists.
.iloc selects data only by integer location
Let's now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.
There are three different inputs you can use for .iloc
An integer
A list of integers
Slice notation using integers as the start and stop values
Selecting a single row with .iloc with an integer
df.iloc[4]
This returns the 5th row (integer location 4) as a Series
age 32
color gray
food Cheese
height 180
score 1.8
state AK
Name: Dean, dtype: object
Selecting multiple rows with .iloc with a list of integers
df.iloc[[2, -2]]
This returns a DataFrame of the third and second to last rows:
Selecting multiple rows with .iloc with slice notation
df.iloc[:5:3]
Simultaneous selection of rows and columns with .loc and .iloc
One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.
For example, we can select rows Jane, and Dean with just the columns height, score and state like this:
df.loc[['Jane', 'Dean'], 'height':]
This uses a list of labels for the rows and slice notation for the columns
We can naturally do similar operations with .iloc using only integers.
df.iloc[[1,4], 2]
Nick Lamb
Dean Cheese
Name: food, dtype: object
Simultaneous selection with labels and integer location
.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.
For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:
col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names]
Or alternatively, convert the index labels to integers with the get_loc index method.
labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
Boolean Selection
The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows wher age is above 30 and return just the food and score columns we can do the following:
df.loc[df['age'] > 30, ['food', 'score']]
You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:
df.iloc[(df['age'] > 30).values, [2, 4]]
Selecting all rows
It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:
df.loc[:, 'color':'score':2]
The indexing operator, [], can select rows and columns too but not simultaneously.
Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.
df['food']
Jane Steak
Nick Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Using a list selects multiple columns
df[['food', 'score']]
What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.
df['Penelope':'Christina'] # slice rows by label
df[2:6:2] # slice rows by integer location
The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.
df[3:5, 'color']
TypeError: unhashable type: 'slice'
.loc and .iloc are used for indexing, i.e., to pull out portions of data. In essence, the difference is that .loc allows label-based indexing, while .iloc allows position-based indexing.
If you get confused by .loc and .iloc, keep in mind that .iloc is based on the index (starting with i) position, while .loc is based on the label (starting with l).
.loc
.loc is supposed to be based on the index labels and not the positions, so it is analogous to Python dictionary-based indexing. However, it can accept boolean arrays, slices, and a list of labels (none of which work with a Python dictionary).
iloc
.iloc does the lookup based on index position, i.e., pandas behaves similarly to a Python list. pandas will raise an IndexError if there is no index at that location.
Examples
The following examples are presented to illustrate the differences between .iloc and .loc. Let's consider the following series:
>>> s = pd.Series([11, 9], index=["1990", "1993"], name="Magic Numbers")
>>> s
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.iloc Examples
>>> s.iloc[0]
11
>>> s.iloc[-1]
9
>>> s.iloc[4]
Traceback (most recent call last):
...
IndexError: single positional indexer is out-of-bounds
>>> s.iloc[0:3] # slice
1990 11
1993 9
Name: Magic Numbers , dtype: int64
>>> s.iloc[[0,1]] # list
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.loc Examples
>>> s.loc['1990']
11
>>> s.loc['1970']
Traceback (most recent call last):
...
KeyError: ’the label [1970] is not in the [index]’
>>> mask = s > 9
>>> s.loc[mask]
1990 11
Name: Magic Numbers , dtype: int64
>>> s.loc['1990':] # slice
1990 11
1993 9
Name: Magic Numbers, dtype: int64
Because s has string index values, .loc will fail when
indexing with an integer:
>>> s.loc[0]
Traceback (most recent call last):
...
KeyError: 0
This example will illustrate the difference:
df = pd.DataFrame({'col1': [1,2,3,4,5], 'col2': ["foo", "bar", "baz", "foobar", "foobaz"]})
col1 col2
0 1 foo
1 2 bar
2 3 baz
3 4 foobar
4 5 foobaz
df = df.sort_values('col1', ascending = False)
col1 col2
4 5 foobaz
3 4 foobar
2 3 baz
1 2 bar
0 1 foo
Index based access:
df.iloc[0, 0:2]
col1 5
col2 foobaz
Name: 4, dtype: object
We get the first row of the sorted dataframe. (This is not the row with index 0, but with index 4).
Position based access:
df.loc[0, 'col1':'col2']
col1 1
col2 foo
Name: 0, dtype: object
We get the row with index 0, even when the df is sorted.
DataFrame.loc() : Select rows by index value
DataFrame.iloc() : Select rows by rows number
Example:
Select first 5 rows of a table, df1 is your dataframe
df1.iloc[:5]
Select first A, B rows of a table, df1 is your dataframe
df1.loc['A','B']

Resources