Pandas converts integer numbers to real numbers when reading from Excel - excel

I recently started exploring python for analyzing excel data.
I have an excel file with two worksheets, each one with one matrix (with m = 1000 rows and n= 999 columns).The elements of both matrices are related to each other: one of the matrices concerns diplacement values and the other matrix concerns the force values corresponding to each displacement. The displacements and corresponding forces are obtained from m=1000 numerical simulations and n= 999 increments. Is it possible to identify the force values that correspond only to displacement values that are integer numbers? Or, as an alternative, is it possible to replace all the decimal numbers from the matrix of displacements by 0? I tried to read the excel file into a Pandas dataframe, however all values from the matrix of displacements seem presented as "real numbers" (e.g. numbers "1", "2", "3", etc. from excel are presented with a floating point as "1.", "2.", "3." in python).
Thank you for your attention.

Let's make an example in a smaller scale (3 * 3).
I prepared an Excel file with 2 sheets and read them:
displ = pd.read_excel('Input_2.xlsx', 'Displ')
forces = pd.read_excel('Input_2.xlsx', 'Forces')
Both DataFrames contain:
displ forces
C1 C2 C3 C1 C2 C3
0 10.0 12.1 11.3 0 120.1 130.2 140.3
1 12.5 13.0 13.5 1 150.4 160.5 170.6
2 12.6 13.6 13.8 2 180.7 190.8 200.9
To identify elements of displ containing integer numbers
(actually, still float numbers, but with the fractional
parts == 0.0), you can run:
displ.mod(1.0) == 0.0
and you will get:
C1 C2 C3
0 True False False
1 False True False
2 False False False
And to get corresponding force values and NaN
for other values, you can run:
forces.where(displ.mod(1.0) == 0.0)
getting:
C1 C2 C3
0 120.1 NaN NaN
1 NaN 160.5 NaN
2 NaN NaN NaN
Another option is to get a list of indices in displ where
the corresponding element has zero fractional part.
Actually it is a Numpy function, so it operates on the
underlying Numpy array and returns integer (zero-based)
indices:
ind = np.nonzero((displ.mod(1.0) == 0.0).values)
The result is:
(array([0, 1], dtype=int64), array([0, 1], dtype=int64))
so it is a 2-tuple of indices:
row indices,
column indices.
You can also retrieve a list of indicated elements from
forces, actually also from the underlying Numpy array,
running:
forces.values[ind]
The result is:
array([120.1, 160.5])
To replace "integer" elements of displ with zeroes, you
can run:
displ.mask(displ.mod(1.0) == 0.0, 0, inplace=True)
Now displ contains:
C1 C2 C3
0 0.0 12.1 11.3
1 12.5 0.0 13.5
2 12.6 13.6 13.8
Note that the "wanted" elements are still float zeroes,
but this is a feature of Pandas that each column has one
type, fitting all elements in this column (in this case just float).

Related

Why Pandas index will keep same when I explode it and How does iloc work in the same index? [duplicate]

Can someone explain how these two methods of slicing are different?
I've seen the docs,
and I've seen these answers, but I still find myself unable to understand how the three are different. To me, they seem interchangeable in large part, because they are at the lower levels of slicing.
For example, say we want to get the first five rows of a DataFrame. How is it that these two work?
df.loc[:5]
df.iloc[:5]
Can someone present three cases where the distinction in uses are clearer?
Once upon a time, I also wanted to know how these two functions differ from df.ix[:5] but ix has been removed from pandas 1.0, so I don't care anymore.
Label vs. Location
The main distinction between the two methods is:
loc gets rows (and/or columns) with particular labels.
iloc gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:
<object>
description
s.loc[<object>]
s.iloc[<object>]
0
single item
Value at index label 0 (the string 'd')
Value at index location 0 (the string 'a')
0:1
slice
Two rows (labels 0 and 1)
One row (first row at location 0)
1:47
slice with out-of-bounds end
Zero rows (empty Series)
Five rows (location 1 onwards)
1:47:-1
slice with negative step
three rows (labels 1 back to 47)
Zero rows (empty Series)
[2, 0]
integer list
Two rows with given labels
Two rows with given locations
s > 'e'
Bool series (indicating which values have the property)
One row (containing 'f')
NotImplementedError
(s>'e').values
Bool array
One row (containing 'f')
Same as loc
999
int object not in index
KeyError
IndexError (out of bounds)
-1
int object not in index
KeyError
Returns last value in s
lambda x: x.index[3]
callable applied to series (here returning 3rd item in index)
s.loc[s.index[3]]
s.iloc[s.index[3]]
loc's label-querying capabilities extend well-beyond integer indexes and it's worth highlighting a couple of additional examples.
Here's a Series where the index contains string objects:
>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a 49
b 48
c 47
d 0
e 1
f 2
Since loc is label-based, it can fetch the first value in the Series using s2.loc['a']. It can also slice with non-integer objects:
>>> s2.loc['c':'e'] # all rows lying between 'c' and 'e' (inclusive)
c 47
d 0
e 1
For DateTime indexes, we don't need to pass the exact date/time to fetch by label. For example:
>>> s3 = pd.Series(list('abcde'), pd.date_range('now', periods=5, freq='M'))
>>> s3
2021-01-31 16:41:31.879768 a
2021-02-28 16:41:31.879768 b
2021-03-31 16:41:31.879768 c
2021-04-30 16:41:31.879768 d
2021-05-31 16:41:31.879768 e
Then to fetch the row(s) for March/April 2021 we only need:
>>> s3.loc['2021-03':'2021-04']
2021-03-31 17:04:30.742316 c
2021-04-30 17:04:30.742316 d
Rows and Columns
loc and iloc work the same way with DataFrames as they do with Series. It's useful to note that both methods can address columns and rows together.
When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.
Consider the DataFrame defined below:
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
Then for example:
>>> df.loc['c': , :'z'] # rows 'c' and onwards AND columns up to 'z'
x y z
c 10 11 12
d 15 16 17
e 20 21 22
>>> df.iloc[:, 3] # all rows, but only the column at index location 3
a 3
b 8
c 13
d 18
e 23
Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of loc and iloc.
For example, consider the following DataFrame. How best to slice the rows up to and including 'c' and take the first four columns?
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
We can achieve this result using iloc and the help of another method:
>>> df.iloc[:df.index.get_loc('c') + 1, :4]
x y z 8
a 0 1 2 3
b 5 6 7 8
c 10 11 12 13
get_loc() is an index method meaning "get the position of the label in this index". Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row 'c' as well.
iloc works based on integer positioning. So no matter what your row labels are, you can always, e.g., get the first row by doing
df.iloc[0]
or the last five rows by doing
df.iloc[-5:]
You can also use it on the columns. This retrieves the 3rd column:
df.iloc[:, 2] # the : in the first position indicates all rows
You can combine them to get intersections of rows and columns:
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)
On the other hand, .loc use named indices. Let's set up a data frame with strings as row and column labels:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Then we can get the first row by
df.loc['a'] # equivalent to df.iloc[0]
and the second two rows of the 'date' column by
df.loc['b':, 'date'] # equivalent to df.iloc[1:, 1]
and so on. Now, it's probably worth pointing out that the default row and column indices for a DataFrame are integers from 0 and in this case iloc and loc would work in the same way. This is why your three examples are equivalent. If you had a non-numeric index such as strings or datetimes, df.loc[:5] would raise an error.
Also, you can do column retrieval just by using the data frame's __getitem__:
df['time'] # equivalent to df.loc[:, 'time']
Now suppose you want to mix position and named indexing, that is, indexing using names on rows and positions on columns (to clarify, I mean select from our data frame, rather than creating a data frame with strings in the row index and integers in the column index). This is where .ix comes in:
df.ix[:2, 'time'] # the first two rows of the 'time' column
I think it's also worth mentioning that you can pass boolean vectors to the loc method as well. For example:
b = [True, False, True]
df.loc[b]
Will return the 1st and 3rd rows of df. This is equivalent to df[b] for selection, but it can also be used for assigning via boolean vectors:
df.loc[b, 'name'] = 'Mary', 'John'
In my opinion, the accepted answer is confusing, since it uses a DataFrame with only missing values. I also do not like the term position-based for .iloc and instead, prefer integer location as it is much more descriptive and exactly what .iloc stands for. The key word is INTEGER - .iloc needs INTEGERS.
See my extremely detailed blog series on subset selection for more
.ix is deprecated and ambiguous and should never be used
Because .ix is deprecated we will only focus on the differences between .loc and .iloc.
Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each index. Let's take a look at a sample DataFrame:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used for the index.
The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.
.loc selects data only by labels
We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length of the DataFrame.
There are three different inputs you can use for .loc
A string
A list of strings
Slice notation using strings as the start and stop values
Selecting a single row with .loc with a string
To select a single row of data, place the index label inside of the brackets following .loc.
df.loc['Penelope']
This returns the row of data as a Series
age 4
color white
food Apple
height 80
score 3.3
state AL
Name: Penelope, dtype: object
Selecting multiple rows with .loc with a list of strings
df.loc[['Cornelia', 'Jane', 'Dean']]
This returns a DataFrame with the rows in the order specified in the list:
Selecting multiple rows with .loc with slice notation
Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.
df.loc['Aaron':'Dean']
Complex slices can be taken in the same manner as Python lists.
.iloc selects data only by integer location
Let's now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.
There are three different inputs you can use for .iloc
An integer
A list of integers
Slice notation using integers as the start and stop values
Selecting a single row with .iloc with an integer
df.iloc[4]
This returns the 5th row (integer location 4) as a Series
age 32
color gray
food Cheese
height 180
score 1.8
state AK
Name: Dean, dtype: object
Selecting multiple rows with .iloc with a list of integers
df.iloc[[2, -2]]
This returns a DataFrame of the third and second to last rows:
Selecting multiple rows with .iloc with slice notation
df.iloc[:5:3]
Simultaneous selection of rows and columns with .loc and .iloc
One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.
For example, we can select rows Jane, and Dean with just the columns height, score and state like this:
df.loc[['Jane', 'Dean'], 'height':]
This uses a list of labels for the rows and slice notation for the columns
We can naturally do similar operations with .iloc using only integers.
df.iloc[[1,4], 2]
Nick Lamb
Dean Cheese
Name: food, dtype: object
Simultaneous selection with labels and integer location
.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.
For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:
col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names]
Or alternatively, convert the index labels to integers with the get_loc index method.
labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
Boolean Selection
The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows wher age is above 30 and return just the food and score columns we can do the following:
df.loc[df['age'] > 30, ['food', 'score']]
You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:
df.iloc[(df['age'] > 30).values, [2, 4]]
Selecting all rows
It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:
df.loc[:, 'color':'score':2]
The indexing operator, [], can select rows and columns too but not simultaneously.
Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.
df['food']
Jane Steak
Nick Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Using a list selects multiple columns
df[['food', 'score']]
What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.
df['Penelope':'Christina'] # slice rows by label
df[2:6:2] # slice rows by integer location
The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.
df[3:5, 'color']
TypeError: unhashable type: 'slice'
.loc and .iloc are used for indexing, i.e., to pull out portions of data. In essence, the difference is that .loc allows label-based indexing, while .iloc allows position-based indexing.
If you get confused by .loc and .iloc, keep in mind that .iloc is based on the index (starting with i) position, while .loc is based on the label (starting with l).
.loc
.loc is supposed to be based on the index labels and not the positions, so it is analogous to Python dictionary-based indexing. However, it can accept boolean arrays, slices, and a list of labels (none of which work with a Python dictionary).
iloc
.iloc does the lookup based on index position, i.e., pandas behaves similarly to a Python list. pandas will raise an IndexError if there is no index at that location.
Examples
The following examples are presented to illustrate the differences between .iloc and .loc. Let's consider the following series:
>>> s = pd.Series([11, 9], index=["1990", "1993"], name="Magic Numbers")
>>> s
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.iloc Examples
>>> s.iloc[0]
11
>>> s.iloc[-1]
9
>>> s.iloc[4]
Traceback (most recent call last):
...
IndexError: single positional indexer is out-of-bounds
>>> s.iloc[0:3] # slice
1990 11
1993 9
Name: Magic Numbers , dtype: int64
>>> s.iloc[[0,1]] # list
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.loc Examples
>>> s.loc['1990']
11
>>> s.loc['1970']
Traceback (most recent call last):
...
KeyError: ’the label [1970] is not in the [index]’
>>> mask = s > 9
>>> s.loc[mask]
1990 11
Name: Magic Numbers , dtype: int64
>>> s.loc['1990':] # slice
1990 11
1993 9
Name: Magic Numbers, dtype: int64
Because s has string index values, .loc will fail when
indexing with an integer:
>>> s.loc[0]
Traceback (most recent call last):
...
KeyError: 0
This example will illustrate the difference:
df = pd.DataFrame({'col1': [1,2,3,4,5], 'col2': ["foo", "bar", "baz", "foobar", "foobaz"]})
col1 col2
0 1 foo
1 2 bar
2 3 baz
3 4 foobar
4 5 foobaz
df = df.sort_values('col1', ascending = False)
col1 col2
4 5 foobaz
3 4 foobar
2 3 baz
1 2 bar
0 1 foo
Index based access:
df.iloc[0, 0:2]
col1 5
col2 foobaz
Name: 4, dtype: object
We get the first row of the sorted dataframe. (This is not the row with index 0, but with index 4).
Position based access:
df.loc[0, 'col1':'col2']
col1 1
col2 foo
Name: 0, dtype: object
We get the row with index 0, even when the df is sorted.
DataFrame.loc() : Select rows by index value
DataFrame.iloc() : Select rows by rows number
Example:
Select first 5 rows of a table, df1 is your dataframe
df1.iloc[:5]
Select first A, B rows of a table, df1 is your dataframe
df1.loc['A','B']

What is the difference between interpolation and imputation?

I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?
Interpolation
Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:
Two red points are known
Blue point is missing
source: wikipedia
Oke nice explanation, but show me with data.
First of all the formula for linear interpolation is the following:
(y1-y0) / (x1-x0)
Let's say we have the three data points from the graph above:
df = pd.DataFrame({'Value':[0, np.NaN, 3]})
Value
0 0.0
1 NaN
2 3.0
As we can see row 1 (blue point) is missing.
So following formula from above:
(3-0) / (2-0) = 1.5
If we interpolate these using the pandas method Series.interpolate:
df['Value'].interpolate()
0 0.0
1 1.5
2 3.0
Name: Value, dtype: float64
For a bigger dataset it would look as follows:
df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})
Value
0 1.0
1 NaN
2 4.0
3 NaN
4 NaN
5 7.0
df['Value'].interpolate()
0 1.0
1 2.5
2 4.0
3 5.0
4 6.0
5 7.0
Name: Value, dtype: float64
Imputation
When we impute the data with the (arithmetic) mean, we follow the following formula:
sum(all points) / n
So for our second dataframe we get:
(1 + 4 + 7) / 3 = 4
So if we impute our dataframe with Series.fillna and Series.mean:
df['Value'].fillna(df['Value'].mean())
0 1.0
1 4.0
2 4.0
3 4.0
4 4.0
5 7.0
Name: Value, dtype: float64
I will answer the second part of your question i.e. when to use what.
We use both techniques depending upon the use case.
Imputation:
If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.
Interpolation:
If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.
So, you can choose the technique depending upon the use case.

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

Correct syntax for accessing a row in Pandas dataframe [duplicate]

Can someone explain how these two methods of slicing are different?
I've seen the docs,
and I've seen these answers, but I still find myself unable to understand how the three are different. To me, they seem interchangeable in large part, because they are at the lower levels of slicing.
For example, say we want to get the first five rows of a DataFrame. How is it that these two work?
df.loc[:5]
df.iloc[:5]
Can someone present three cases where the distinction in uses are clearer?
Once upon a time, I also wanted to know how these two functions differ from df.ix[:5] but ix has been removed from pandas 1.0, so I don't care anymore.
Label vs. Location
The main distinction between the two methods is:
loc gets rows (and/or columns) with particular labels.
iloc gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:
<object>
description
s.loc[<object>]
s.iloc[<object>]
0
single item
Value at index label 0 (the string 'd')
Value at index location 0 (the string 'a')
0:1
slice
Two rows (labels 0 and 1)
One row (first row at location 0)
1:47
slice with out-of-bounds end
Zero rows (empty Series)
Five rows (location 1 onwards)
1:47:-1
slice with negative step
three rows (labels 1 back to 47)
Zero rows (empty Series)
[2, 0]
integer list
Two rows with given labels
Two rows with given locations
s > 'e'
Bool series (indicating which values have the property)
One row (containing 'f')
NotImplementedError
(s>'e').values
Bool array
One row (containing 'f')
Same as loc
999
int object not in index
KeyError
IndexError (out of bounds)
-1
int object not in index
KeyError
Returns last value in s
lambda x: x.index[3]
callable applied to series (here returning 3rd item in index)
s.loc[s.index[3]]
s.iloc[s.index[3]]
loc's label-querying capabilities extend well-beyond integer indexes and it's worth highlighting a couple of additional examples.
Here's a Series where the index contains string objects:
>>> s2 = pd.Series(s.index, index=s.values)
>>> s2
a 49
b 48
c 47
d 0
e 1
f 2
Since loc is label-based, it can fetch the first value in the Series using s2.loc['a']. It can also slice with non-integer objects:
>>> s2.loc['c':'e'] # all rows lying between 'c' and 'e' (inclusive)
c 47
d 0
e 1
For DateTime indexes, we don't need to pass the exact date/time to fetch by label. For example:
>>> s3 = pd.Series(list('abcde'), pd.date_range('now', periods=5, freq='M'))
>>> s3
2021-01-31 16:41:31.879768 a
2021-02-28 16:41:31.879768 b
2021-03-31 16:41:31.879768 c
2021-04-30 16:41:31.879768 d
2021-05-31 16:41:31.879768 e
Then to fetch the row(s) for March/April 2021 we only need:
>>> s3.loc['2021-03':'2021-04']
2021-03-31 17:04:30.742316 c
2021-04-30 17:04:30.742316 d
Rows and Columns
loc and iloc work the same way with DataFrames as they do with Series. It's useful to note that both methods can address columns and rows together.
When given a tuple, the first element is used to index the rows and, if it exists, the second element is used to index the columns.
Consider the DataFrame defined below:
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
Then for example:
>>> df.loc['c': , :'z'] # rows 'c' and onwards AND columns up to 'z'
x y z
c 10 11 12
d 15 16 17
e 20 21 22
>>> df.iloc[:, 3] # all rows, but only the column at index location 3
a 3
b 8
c 13
d 18
e 23
Sometimes we want to mix label and positional indexing methods for the rows and columns, somehow combining the capabilities of loc and iloc.
For example, consider the following DataFrame. How best to slice the rows up to and including 'c' and take the first four columns?
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(25).reshape(5, 5),
index=list('abcde'),
columns=['x','y','z', 8, 9])
>>> df
x y z 8 9
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
e 20 21 22 23 24
We can achieve this result using iloc and the help of another method:
>>> df.iloc[:df.index.get_loc('c') + 1, :4]
x y z 8
a 0 1 2 3
b 5 6 7 8
c 10 11 12 13
get_loc() is an index method meaning "get the position of the label in this index". Note that since slicing with iloc is exclusive of its endpoint, we must add 1 to this value if we want row 'c' as well.
iloc works based on integer positioning. So no matter what your row labels are, you can always, e.g., get the first row by doing
df.iloc[0]
or the last five rows by doing
df.iloc[-5:]
You can also use it on the columns. This retrieves the 3rd column:
df.iloc[:, 2] # the : in the first position indicates all rows
You can combine them to get intersections of rows and columns:
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)
On the other hand, .loc use named indices. Let's set up a data frame with strings as row and column labels:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Then we can get the first row by
df.loc['a'] # equivalent to df.iloc[0]
and the second two rows of the 'date' column by
df.loc['b':, 'date'] # equivalent to df.iloc[1:, 1]
and so on. Now, it's probably worth pointing out that the default row and column indices for a DataFrame are integers from 0 and in this case iloc and loc would work in the same way. This is why your three examples are equivalent. If you had a non-numeric index such as strings or datetimes, df.loc[:5] would raise an error.
Also, you can do column retrieval just by using the data frame's __getitem__:
df['time'] # equivalent to df.loc[:, 'time']
Now suppose you want to mix position and named indexing, that is, indexing using names on rows and positions on columns (to clarify, I mean select from our data frame, rather than creating a data frame with strings in the row index and integers in the column index). This is where .ix comes in:
df.ix[:2, 'time'] # the first two rows of the 'time' column
I think it's also worth mentioning that you can pass boolean vectors to the loc method as well. For example:
b = [True, False, True]
df.loc[b]
Will return the 1st and 3rd rows of df. This is equivalent to df[b] for selection, but it can also be used for assigning via boolean vectors:
df.loc[b, 'name'] = 'Mary', 'John'
In my opinion, the accepted answer is confusing, since it uses a DataFrame with only missing values. I also do not like the term position-based for .iloc and instead, prefer integer location as it is much more descriptive and exactly what .iloc stands for. The key word is INTEGER - .iloc needs INTEGERS.
See my extremely detailed blog series on subset selection for more
.ix is deprecated and ambiguous and should never be used
Because .ix is deprecated we will only focus on the differences between .loc and .iloc.
Before we talk about the differences, it is important to understand that DataFrames have labels that help identify each column and each index. Let's take a look at a sample DataFrame:
df = pd.DataFrame({'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'])
All the words in bold are the labels. The labels, age, color, food, height, score and state are used for the columns. The other labels, Jane, Nick, Aaron, Penelope, Dean, Christina, Cornelia are used for the index.
The primary ways to select particular rows in a DataFrame are with the .loc and .iloc indexers. Each of these indexers can also be used to simultaneously select columns but it is easier to just focus on rows for now. Also, each of the indexers use a set of brackets that immediately follow their name to make their selections.
.loc selects data only by labels
We will first talk about the .loc indexer which only selects data by the index or column labels. In our sample DataFrame, we have provided meaningful names as values for the index. Many DataFrames will not have any meaningful names and will instead, default to just the integers from 0 to n-1, where n is the length of the DataFrame.
There are three different inputs you can use for .loc
A string
A list of strings
Slice notation using strings as the start and stop values
Selecting a single row with .loc with a string
To select a single row of data, place the index label inside of the brackets following .loc.
df.loc['Penelope']
This returns the row of data as a Series
age 4
color white
food Apple
height 80
score 3.3
state AL
Name: Penelope, dtype: object
Selecting multiple rows with .loc with a list of strings
df.loc[['Cornelia', 'Jane', 'Dean']]
This returns a DataFrame with the rows in the order specified in the list:
Selecting multiple rows with .loc with slice notation
Slice notation is defined by a start, stop and step values. When slicing by label, pandas includes the stop value in the return. The following slices from Aaron to Dean, inclusive. Its step size is not explicitly defined but defaulted to 1.
df.loc['Aaron':'Dean']
Complex slices can be taken in the same manner as Python lists.
.iloc selects data only by integer location
Let's now turn to .iloc. Every row and column of data in a DataFrame has an integer location that defines it. This is in addition to the label that is visually displayed in the output. The integer location is simply the number of rows/columns from the top/left beginning at 0.
There are three different inputs you can use for .iloc
An integer
A list of integers
Slice notation using integers as the start and stop values
Selecting a single row with .iloc with an integer
df.iloc[4]
This returns the 5th row (integer location 4) as a Series
age 32
color gray
food Cheese
height 180
score 1.8
state AK
Name: Dean, dtype: object
Selecting multiple rows with .iloc with a list of integers
df.iloc[[2, -2]]
This returns a DataFrame of the third and second to last rows:
Selecting multiple rows with .iloc with slice notation
df.iloc[:5:3]
Simultaneous selection of rows and columns with .loc and .iloc
One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns were returned from each selection. We can choose columns with the same types of inputs as we do for rows. We simply need to separate the row and column selection with a comma.
For example, we can select rows Jane, and Dean with just the columns height, score and state like this:
df.loc[['Jane', 'Dean'], 'height':]
This uses a list of labels for the rows and slice notation for the columns
We can naturally do similar operations with .iloc using only integers.
df.iloc[[1,4], 2]
Nick Lamb
Dean Cheese
Name: food, dtype: object
Simultaneous selection with labels and integer location
.ix was used to make selections simultaneously with labels and integer location which was useful but confusing and ambiguous at times and thankfully it has been deprecated. In the event that you need to make a selection with a mix of labels and integer locations, you will have to make both your selections labels or integer locations.
For instance, if we want to select rows Nick and Cornelia along with columns 2 and 4, we could use .loc by converting the integers to labels with the following:
col_names = df.columns[[2, 4]]
df.loc[['Nick', 'Cornelia'], col_names]
Or alternatively, convert the index labels to integers with the get_loc index method.
labels = ['Nick', 'Cornelia']
index_ints = [df.index.get_loc(label) for label in labels]
df.iloc[index_ints, [2, 4]]
Boolean Selection
The .loc indexer can also do boolean selection. For instance, if we are interested in finding all the rows wher age is above 30 and return just the food and score columns we can do the following:
df.loc[df['age'] > 30, ['food', 'score']]
You can replicate this with .iloc but you cannot pass it a boolean series. You must convert the boolean Series into a numpy array like this:
df.iloc[(df['age'] > 30).values, [2, 4]]
Selecting all rows
It is possible to use .loc/.iloc for just column selection. You can select all the rows by using a colon like this:
df.loc[:, 'color':'score':2]
The indexing operator, [], can select rows and columns too but not simultaneously.
Most people are familiar with the primary purpose of the DataFrame indexing operator, which is to select columns. A string selects a single column as a Series and a list of strings selects multiple columns as a DataFrame.
df['food']
Jane Steak
Nick Lamb
Aaron Mango
Penelope Apple
Dean Cheese
Christina Melon
Cornelia Beans
Name: food, dtype: object
Using a list selects multiple columns
df[['food', 'score']]
What people are less familiar with, is that, when slice notation is used, then selection happens by row labels or by integer location. This is very confusing and something that I almost never use but it does work.
df['Penelope':'Christina'] # slice rows by label
df[2:6:2] # slice rows by integer location
The explicitness of .loc/.iloc for selecting rows is highly preferred. The indexing operator alone is unable to select rows and columns simultaneously.
df[3:5, 'color']
TypeError: unhashable type: 'slice'
.loc and .iloc are used for indexing, i.e., to pull out portions of data. In essence, the difference is that .loc allows label-based indexing, while .iloc allows position-based indexing.
If you get confused by .loc and .iloc, keep in mind that .iloc is based on the index (starting with i) position, while .loc is based on the label (starting with l).
.loc
.loc is supposed to be based on the index labels and not the positions, so it is analogous to Python dictionary-based indexing. However, it can accept boolean arrays, slices, and a list of labels (none of which work with a Python dictionary).
iloc
.iloc does the lookup based on index position, i.e., pandas behaves similarly to a Python list. pandas will raise an IndexError if there is no index at that location.
Examples
The following examples are presented to illustrate the differences between .iloc and .loc. Let's consider the following series:
>>> s = pd.Series([11, 9], index=["1990", "1993"], name="Magic Numbers")
>>> s
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.iloc Examples
>>> s.iloc[0]
11
>>> s.iloc[-1]
9
>>> s.iloc[4]
Traceback (most recent call last):
...
IndexError: single positional indexer is out-of-bounds
>>> s.iloc[0:3] # slice
1990 11
1993 9
Name: Magic Numbers , dtype: int64
>>> s.iloc[[0,1]] # list
1990 11
1993 9
Name: Magic Numbers , dtype: int64
.loc Examples
>>> s.loc['1990']
11
>>> s.loc['1970']
Traceback (most recent call last):
...
KeyError: ’the label [1970] is not in the [index]’
>>> mask = s > 9
>>> s.loc[mask]
1990 11
Name: Magic Numbers , dtype: int64
>>> s.loc['1990':] # slice
1990 11
1993 9
Name: Magic Numbers, dtype: int64
Because s has string index values, .loc will fail when
indexing with an integer:
>>> s.loc[0]
Traceback (most recent call last):
...
KeyError: 0
This example will illustrate the difference:
df = pd.DataFrame({'col1': [1,2,3,4,5], 'col2': ["foo", "bar", "baz", "foobar", "foobaz"]})
col1 col2
0 1 foo
1 2 bar
2 3 baz
3 4 foobar
4 5 foobaz
df = df.sort_values('col1', ascending = False)
col1 col2
4 5 foobaz
3 4 foobar
2 3 baz
1 2 bar
0 1 foo
Index based access:
df.iloc[0, 0:2]
col1 5
col2 foobaz
Name: 4, dtype: object
We get the first row of the sorted dataframe. (This is not the row with index 0, but with index 4).
Position based access:
df.loc[0, 'col1':'col2']
col1 1
col2 foo
Name: 0, dtype: object
We get the row with index 0, even when the df is sorted.
DataFrame.loc() : Select rows by index value
DataFrame.iloc() : Select rows by rows number
Example:
Select first 5 rows of a table, df1 is your dataframe
df1.iloc[:5]
Select first A, B rows of a table, df1 is your dataframe
df1.loc['A','B']

How do I get rid of NaNs in MATLAB?

I have files which have many empty cells which appear as NaNs when I use cell2mat, but the problem is when I need to get the average values I cannot work with this as it shows error with NaN. In excel it overlooks NaN values, so how do I do the same in MATLAB?
In addition, I am writing a file using xlswrite:
xlswrite('test.xls',M);
I have data in all rows except 1. How do I write:
M(1,:) = ('time', 'count', 'length', 'width')
In other words, I want M(1,1)='time', M(1,2)='count', and so on. I have data from M(2,1) to M(10,20). How can I do this?
As AP correctly points out, you can use the function isfinite to find and keep only finite values in your matrix. You can also use the function isnan. However, removing values from your matrix can have the unintended consequence of reshaping your matrix into a row or column vector:
>> mat = [1 2 3; 4 NaN 6; 7 8 9] % A sample 3-by-3 matrix
mat =
1 2 3
4 NaN 6
7 8 9
>> mat = mat(~isnan(mat)) % Removing the NaN gives you an 8-by-1 vector
mat =
1
4
7
2
8
3
6
9
Another alternative is to use some functions from the Statistics Toolbox (if you have access to it) that are designed to deal with matrices containing NaN values. Since you mention taking averages, you may want to check out nanmean:
>> mat = [1 2 3; 4 NaN 6; 7 8 9];
>> nanmean(mat)
ans =
4 5 6 % The column means computed by ignoring NaN values
EDIT: To answer your additional question on the use of xlswrite, this sample code should illustrate one way you can write your data:
C = {'time','count','length','width'}; % A cell array of strings
M = rand(10,20); % A 10-by-20 array of random values
xlswrite('test.xls',C); % Writes C to cells A1 through D1
xlswrite('test.xls',M,'A2:T11'); % Writes M to cells A2 through T11
Use ' isfinite ' function to get rid of all NaN and infinities
A=A(isfinite(A))
%create the cell array containing the column headers
columnHeader = {'Column 1', 'Column 2', 'Column 3', 'Column 4', 'Column 5',' '};
%write the column headers first
xlswrite('myFile1.xls', columnHeader );
% write the data directly underneath the column headers
xlswrite('newFile.xls',M,'Sheet1','A2');
Statistics Toolbox has several statistical functions to deal with NaN values. See nanmean, nanmedian, nanstd, nanmin, nanmax, etc.
You can set NaN's to an arbitrary number like so:
mat(isnan(mat))=7 // my lucky number of choice.
May be too late, but...
x = [1 2 3; 4 inf 6; 7 -inf NaN];
x(find(x == inf)) = 0; //for inf
x(find(x == -inf)) = 0; //for -inf
x(find(isnan(x))) = 0; //for NaN

Resources