PySpark, top for DataFrame

PySpark, top for DataFrame - apache-spark

What I want to do is given a DataFrame, take top n elements according to some specified column. The top(self, num) in RDD API is exactly what I want. I wonder if there is equivalent API in DataFrame world ?
My first attempt is the following
def retrieve_top_n(df, n):
# assume we want to get most popular n 'key' in DataFrame
return df.groupBy('key').count().orderBy('count', ascending=False).limit(n).select('key')
However, I've realized that this results in non-deterministic behavior (I don't know the exact reason but I guess limit(n) doesn't guarantee which n to take)

First let's define a function to generate test data:
import numpy as np
def sample_df(num_records):
def data():
np.random.seed(42)
while True:
yield int(np.random.normal(100., 80.))
data_iter = iter(data())
df = sc.parallelize((
(i, next(data_iter)) for i in range(int(num_records))
)).toDF(('index', 'key_col'))
return df
sample_df(1e3).show(n=5)
+-----+-------+
|index|key_col|
+-----+-------+
| 0| 139|
| 1| 88|
| 2| 151|
| 3| 221|
| 4| 81|
+-----+-------+
only showing top 5 rows
Now, let's propose three different ways to calculate TopK:
from pyspark.sql import Window
from pyspark.sql import functions
def top_df_0(df, key_col, K):
"""
Using window functions. Handles ties OK.
"""
window = Window.orderBy(functions.col(key_col).desc())
return (df
.withColumn("rank", functions.rank().over(window))
.filter(functions.col('rank') <= K)
.drop('rank'))
def top_df_1(df, key_col, K):
"""
Using limit(K). Does NOT handle ties appropriately.
"""
return df.orderBy(functions.col(key_col).desc()).limit(K)
def top_df_2(df, key_col, K):
"""
Using limit(k) and then filtering. Handles ties OK."
"""
num_records = df.count()
value_at_k_rank = (df
.orderBy(functions.col(key_col).desc())
.limit(k)
.select(functions.min(key_col).alias('min'))
.first()['min'])
return df.filter(df[key_col] >= value_at_k_rank)
The function called top_df_1 is similar to the one you originally implemented. The reason it gives you non-deterministic behavior is because it cannot handle ties nicely. This may be an OK thing to do if you have lots of data and are only interested in an approximate answer for the sake of performance.
Finally, let's benchmark
For benchmarking use a Spark DF with 4 million entries and define a convenience function:
NUM_RECORDS = 4e6
test_df = sample_df(NUM_RECORDS).cache()
def show(func, df, key_col, K):
func(df, key_col, K).select(
functions.max(key_col),
functions.min(key_col),
functions.count(key_col)
).show()
Let's see the verdict:
%timeit show(top_df_0, test_df, "key_col", K=100)
+------------+------------+--------------+
|max(key_col)|min(key_col)|count(key_col)|
+------------+------------+--------------+
| 502| 420| 108|
+------------+------------+--------------+
1 loops, best of 3: 1.62 s per loop
%timeit show(top_df_1, test_df, "key_col", K=100)
+------------+------------+--------------+
|max(key_col)|min(key_col)|count(key_col)|
+------------+------------+--------------+
| 502| 420| 100|
+------------+------------+--------------+
1 loops, best of 3: 252 ms per loop
%timeit show(top_df_2, test_df, "key_col", K=100)
+------------+------------+--------------+
|max(key_col)|min(key_col)|count(key_col)|
+------------+------------+--------------+
| 502| 420| 108|
+------------+------------+--------------+
1 loops, best of 3: 725 ms per loop
(Note that top_df_0 and top_df_2 have 108 entries in the top 100. This is due to the presence of tied entries for the 100th best. The top_df_1 implementation is ignoring the tied entries.).
The bottom line
If you want an exact answer go with top_df_2 (it is about 2x better than top_df_0). If you want another x2 in performance and are OK with an approximate answer go with top_df_1 .

Options:
1) Use pyspark sql row_number within a window function - relevant SO: spark dataframe grouping, sorting, and selecting top rows for a set of columns
2) convert ordered df to rdd and use the top function there (hint: this doesn't appear to actually maintain ordering from my quick test, but YMMV)

You should try with head() instead of limit()
#sample data
df = sc.parallelize([
['123', 'b'], ['666', 'a'],
['345', 'd'], ['555', 'a'],
['456', 'b'], ['444', 'a'],
['678', 'd'], ['333', 'a'],
['135', 'd'], ['234', 'd'],
['987', 'c'], ['987', 'e']
]).toDF(('col1', 'key_col'))
#select top 'n' 'key_col' values from dataframe 'df'
def retrieve_top_n(df, key, n):
return sqlContext.createDataFrame(df.groupBy(key).count().orderBy('count', ascending=False).head(n)).select(key)
retrieve_top_n(df, 'key_col', 3).show()
Hope this helps!

Related

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

I found similar question link , but no answer provided how to fix the issue.
I want to make a UDF, that would extract for me words from column. So, I want to create a column named new_column, by applying my UDF to old_column
from pyspark.sql.functions import col, regexp_extract
re_string = 'some|words|I|need|to|match'
def regex_extraction(x,re_string):
return regexp_extract(x,re_string,0)
extracting = udf(lambda row: regex_extraction(row,re_string))
df = df.withColumn("new_column", extracting(col('old_column')))
AttributeError: 'NoneType' object has no attribute '_jvm'
How to fix my function? I have many columns and want to loop through columns list and apply my UDF.

You don't need a UDF. UDF is required when you cannot do something using PySpark, so you need some python functions or libraries. In your case your can have a function which accepts a column and returns a column, but that's it, UDF is not needed.
from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([('some match',)], ['old_column'])
re_string = 'some|words|I|need|to|match'
def regex_extraction(x, re_string):
return regexp_extract(x, re_string, 0)
df = df.withColumn("new_column", regex_extraction('old_column', re_string))
df.show()
# +----------+----------+
# |old_column|new_column|
# +----------+----------+
# |some match| some|
# +----------+----------+
"Looping" through columns in a list can be implemented this way:
from pyspark.sql.functions import regexp_extract
cols = ['col1', 'col2']
df = spark.createDataFrame([('some match', 'match')], cols)
re_string = 'some|words|I|need|to|match'
def regex_extraction(x, re_string):
return regexp_extract(x, re_string, 0)
df = df.select(
'*',
*[regex_extraction(c, re_string).alias(f'new_{c}') for c in cols]
)
df.show()
# +----------+-----+--------+--------+
# | col1| col2|new_col1|new_col2|
# +----------+-----+--------+--------+
# |some match|match| some| match|
# +----------+-----+--------+--------+

Pyspark: aggregate mode (most frequent) value in a rolling window

I have a dataframe such as follows. I would like to group by device and order by start_time within each group. Then, for each row in the group, get the most frequently occurring station from a window of 3 rows before it (including itself).
columns = ['device', 'start_time', 'station']
data = [("Python", 1, "station_1"), ("Python", 2, "station_2"), ("Python", 3, "station_1"), ("Python", 4, "station_2"), ("Python", 5, "station_2"), ("Python", 6, None)]
test_df = spark.createDataFrame(data).toDF(*columns)
rolling_w = Window.partitionBy('device').orderBy('start_time').rowsBetween(-2, 0)
Desired output:
+------+----------+---------+--------------------+
|device|start_time| station|rolling_mode_station|
+------+----------+---------+--------------------+
|Python| 1|station_1| station_1|
|Python| 2|station_2| station_2|
|Python| 3|station_1| station_1|
|Python| 4|station_2| station_2|
|Python| 5|station_2| station_2|
|Python| 6| null| station_2|
+------+----------+---------+--------------------+
Since Pyspark does not have a mode() function, I know how to get the most frequent value in a static groupby as shown here, but I don't know how to adapt it to a rolling window.

You can use collect_list function to get the stations from last 3 rows using the defined window, then for each resulting array calculate the most frequent element.
To get the most frequent element on the array, you can explode it then group by and count as in linked post your already saw or use some UDF like this:
import pyspark.sql.functions as F
test_df.withColumn(
"rolling_mode_station",
F.collect_list("station").over(rolling_w)
).withColumn(
"rolling_mode_station",
F.udf(lambda x: max(set(x), key=x.count))(F.col("rolling_mode_station"))
).show()
#+------+----------+---------+--------------------+
#|device|start_time| station|rolling_mode_station|
#+------+----------+---------+--------------------+
#|Python| 1|station_1| station_1|
#|Python| 2|station_2| station_1|
#|Python| 3|station_1| station_1|
#|Python| 4|station_2| station_2|
#|Python| 5|station_2| station_2|
#|Python| 6| null| station_2|
#+------+----------+---------+--------------------+

I had a similar requirements and this is how I achieved this.
Step 1: Create a UDF for most common element in an array:
import pyspark.sql.functions as F
#F.udf
def mode(x):
from collections import Counter
return Counter(x).most_common(1)[0][0]
Step 2: Window function
test_df_tmp=test_df.withColumn(
"rolling_mode_station",
F.collect_list("station").over(rolling_w)
)
test_df_tmp.show(truncate=False)
Step3: Call UDF created in step 1
test_df_tmp.select('device','start_time','station', mode('rolling_mode_station')).show()

Syntax for np.where replace column values [duplicate]

Background
I just upgraded my Pandas from 0.11 to 0.13.0rc1. Now, the application is popping out many new warnings. One of them like this:
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
I want to know what exactly it means? Do I need to change something?
How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE?
The function that gives errors
def _decode_stock_quote(list_of_150_stk_str):
"""decode the webpage and return dataframe"""
from cStringIO import StringIO
str_of_all = "".join(list_of_150_stk_str)
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
quote_df['TClose'] = quote_df['TPrice']
quote_df['RT'] = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])
return quote_df
More error messages
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which does not always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]
df[df['A'] > 2]['B'] = new_val # new_val not set in df
The warning offers a suggestion to rewrite as follows:
df.loc[df['A'] > 2, 'B'] = new_val
However, this doesn't fit your usage, which is equivalent to:
df = df[df['A'] > 2]
df['B'] = new_val
While it's clear that you don't care about writes making it back to the original frame (since you are overwriting the reference to it), unfortunately this pattern cannot be differentiated from the first chained assignment example. Hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you'd like to read further. You can safely disable this new warning with the following assignment.
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
Other Resources
pandas User Guide: Indexing and selecting data
Python Data Science Handbook: Data Indexing and Selection
Real Python: SettingWithCopyWarning in Pandas: Views vs Copies
Dataquest: SettingwithCopyWarning: How to Fix This Warning in Pandas
Towards Data Science: Explaining the SettingWithCopyWarning in pandas

How to deal with SettingWithCopyWarning in Pandas?
This post is meant for readers who,
Would like to understand what this warning means
Would like to understand different ways of suppressing this warning
Would like to understand how to improve their code and follow good practices to avoid this warning in the future.
Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
What is the SettingWithCopyWarning?
To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.
When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A "view" is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a "copy" is a replication of data from the original, and modifying the copy has no effect on the original.
As mentioned by other answers, the SettingWithCopyWarning was created to flag "chained assignment" operations. Consider df in the setup above. Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,
df[df.A > 5]['B']
1 3
2 6
Name: B, dtype: int64
And,
df.loc[df.A > 5, 'B']
1 3
2 6
Name: B, dtype: int64
These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:
df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)
With a single __setitem__ call to df. OTOH, consider this code:
df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B', 4)
Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.
In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.
More can be found in the documentation.
Note
All boolean indexing operations done with loc can also be done with iloc. The only difference is that iloc expects either
integers/positions for index or a numpy array of boolean values, and
integer/position indexes for the columns.
For example,
df.loc[df.A > 5, 'B'] = 4
Can be written nas
df.iloc[(df.A > 5).values, 1] = 4
And,
df.loc[1, 'A'] = 100
Can be written as
df.iloc[1, 0] = 100
And so on.
Just tell me how to suppress the warning!
Consider a simple operation on the "A" column of df. Selecting "A" and dividing by 2 will raise the warning, but the operation will work.
df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
df2
A
0 2.5
1 4.5
2 3.5
There are a couple ways of directly silencing this warning:
(recommended) Use loc to slice subsets:
df2 = df.loc[:, ['A']]
df2['A'] /= 2 # Does not raise
Change pd.options.mode.chained_assignment
Can be set to None, "warn", or "raise". "warn" is the default. None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError, preventing the operation from going through.
pd.options.mode.chained_assignment = None
df2['A'] /= 2
Make a deepcopy
df2 = df[['A']].copy(deep=True)
df2['A'] /= 2
#Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.
class ChainedAssignent:
def __init__(self, chained=None):
acceptable = [None, 'warn', 'raise']
assert chained in acceptable, "chained must be in " + str(acceptable)
self.swcw = chained
def __enter__(self):
self.saved_swcw = pd.options.mode.chained_assignment
pd.options.mode.chained_assignment = self.swcw
return self
def __exit__(self, *args):
pd.options.mode.chained_assignment = self.saved_swcw
The usage is as follows:
# Some code here
with ChainedAssignent():
df2['A'] /= 2
# More code follows
Or, to raise the exception
with ChainedAssignent(chained='raise'):
df2['A'] /= 2
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The "XY Problem": What am I doing wrong?
A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem "Y" that is actually a symptom of a deeper rooted problem "X". Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.
Question 1
I have a DataFrame
df
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
I want to assign values in col "A" > 5 to 1000. My expected output is
A B C D E
0 5 0 3 3 7
1 1000 3 5 2 4
2 1000 6 8 8 1
Wrong way to do this:
df.A[df.A > 5] = 1000 # works, because df.A returns a view
df[df.A > 5]['A'] = 1000 # does not work
df.loc[df.A > 5]['A'] = 1000 # does not work
Right way using loc:
df.loc[df.A > 5, 'A'] = 1000
Question 21
I am trying to set the value in cell (1, 'D') to 12345. My expected output is
A B C D E
0 5 0 3 3 7
1 9 3 5 12345 4
2 7 6 8 8 1
I have tried different ways of accessing this cell, such as
df['D'][1]. What is the best way to do this?
1. This question isn't specifically related to the warning, but
it is good to understand how to do this particular operation correctly
so as to avoid situations where the warning could potentially arise in
future.
You can use any of the following methods to do this.
df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345
Question 3
I am trying to subset values based on some condition. I have a
DataFrame
A B C D E
1 9 3 5 2 4
2 7 6 8 8 1
I would like to assign values in "D" to 123 such that "C" == 5. I
tried
df2.loc[df2.C == 5, 'D'] = 123
Which seems fine but I am still getting the
SettingWithCopyWarning! How do I fix this?
This is actually probably because of code higher up in your pipeline. Did you create df2 from something larger, like
df2 = df[df.A > 5]
? In this case, boolean indexing will return a view, so df2 will reference the original. What you'd need to do is assign df2 to a copy:
df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]
Question 4
I'm trying to drop column "C" in-place from
A B C D E
1 9 3 5 2 4
2 7 6 8 8 1
But using
df2.drop('C', axis=1, inplace=True)
Throws SettingWithCopyWarning. Why is this happening?
This is because df2 must have been created as a view from some other slicing operation, such as
df2 = df[df.A > 5]
The solution here is to either make a copy() of df, or use loc, as before.

In general the point of the SettingWithCopyWarning is to show users (and especially new users) that they may be operating on a copy and not the original as they think. There are false positives (IOW if you know what you are doing it could be ok). One possibility is simply to turn off the (by default warn) warning as #Garrett suggest.
Here is another option:
In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [2]: dfa = df.ix[:, [1, 0]]
In [3]: dfa.is_copy
Out[3]: True
In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
#!/usr/local/bin/python
You can set the is_copy flag to False, which will effectively turn off the check, for that object:
In [5]: dfa.is_copy = False
In [6]: dfa['A'] /= 2
If you explicitly copy then no further warning will happen:
In [7]: dfa = df.ix[:, [1, 0]].copy()
In [8]: dfa['A'] /= 2
The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. Another way to not have the warning would be to do the selection operation via reindex, e.g.
quote_df = quote_df.reindex(columns=['STK', ...])
Or,
quote_df = quote_df.reindex(['STK', ...], axis=1) # v.0.21

Here I answer the question directly. How can we deal with it?
Make a .copy(deep=False) after you slice. See pandas.DataFrame.copy.
Wait, doesn't a slice return a copy? After all, this is what the warning message is attempting to say? Read the long answer:
import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})
This gives a warning:
df0 = df[df.x>2]
df0['foo'] = 'bar'
This does not:
df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'
Both df0 and df1 are DataFrame objects, but something about them is different that enables pandas to print the warning. Let's find out what it is.
import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)
Using your diff tool of choice, you will see that beyond a couple of addresses, the only material difference is this:
| | slice | slice_copy |
| _is_copy | weakref | None |
The method that decides whether to warn is DataFrame._check_setitem_copy which checks _is_copy. So here you go. Make a copy so that your DataFrame is not _is_copy.
The warning is suggesting to use .loc, but if you use .loc on a frame that _is_copy, you will still get the same warning. Misleading? Yes. Annoying? You bet. Helpful? Potentially, when chained assignment is used. But it cannot correctly detect chain assignment and prints the warning indiscriminately.

Pandas dataframe copy warning
When you go and do something like this:
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
pandas.ix in this case returns a new, stand alone dataframe.
Any values you decide to change in this dataframe, will not change the original dataframe.
This is what pandas tries to warn you about.
Why .ix is a bad idea
The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell.
Given this dataframe:
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})
Two behaviors:
dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2
Behavior one: dfcopy is now a stand alone dataframe. Changing it will not change df
df.ix[0, "a"] = 3
Behavior two: This changes the original dataframe.
Use .loc instead
The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. (The other being .iloc)
.loc is faster, because it does not try to create a copy of the data.
.loc is meant to modify your existing dataframe inplace, which is more memory efficient.
.loc is predictable, it has one behavior.
The solution
What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller.
The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster.
So instead of doing this
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
Do this
columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns
This will only read the columns you are interested in, and name them properly. No need for using the evil .ix object to do magical stuff.

This topic is really confusing with Pandas. Luckily, it has a relatively simple solution.
The problem is that it is not always clear whether data filtering operations (e.g. loc) return a copy or a view of the DataFrame. Further use of such filtered DataFrame could therefore be confusing.
The simple solution is (unless you need to work with very large sets of data):
Whenever you need to update any values, always make sure that you explicitly copy the DataFrame before the assignment.
df # Some DataFrame
df = df.loc[:, 0:2] # Some filtering (unsure whether a view or copy is returned)
df = df.copy() # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny" # Assignment can be done now (no warning)

I had been getting this issue with .apply() when assigning a new dataframe from a pre-existing dataframe on which I've used the .query() method. For instance:
prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)
Would return this error. The fix that seems to resolve the error in this case is by changing this to:
prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)
However, this is not efficient especially when using large dataframes, due to having to make a new copy.
If you're using the .apply() method in generating a new column and its values, a fix that resolves the error and is more efficient is by adding .reset_index(drop=True):
prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

Just simply:
import pandas as pd
# ...
pd.set_option('mode.chained_assignment', None)

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy.
This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation - especially if the copy occurs in a loop like it did for me, etc...)
To be clear, here is the warning I received:
/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Illustration
I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice.
Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning.
Example 1: dropping a column on the original affects the copy
We knew that already but this is a healthy reminder. This is NOT what the warning is about.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
B
0 121
1 122
2 123
It is possible to avoid changes made on df1 to affect df2. Note: you can avoid importing copy.deepcopy by doing df.copy() instead.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
A B
0 111 121
1 112 122
2 113 123
Example 2: dropping a column on the copy may affect the original
This actually illustrates the warning.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1
B
0 121
1 122
2 123
It is possible to avoid changes made on df2 to affect df1
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
>> df2.drop('A', axis=1, inplace=True)
>> df1
A B
0 111 121
1 112 122
2 113 123

This should work:
quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

Some may want to simply suppress the warning:
class SupressSettingWithCopyWarning:
def __enter__(self):
pd.options.mode.chained_assignment = None
def __exit__(self, *args):
pd.options.mode.chained_assignment = 'warn'
with SupressSettingWithCopyWarning():
#code that produces warning

As this question is already fully explained and discussed in existing answers, I will just provide a neat pandas approach to the context manager using pandas.option_context (links to documentation and example) - there is absolutely isn't any need to create a custom class with all the dunder methods and other bells and whistles.
First the context manager code itself:
from contextlib import contextmanager
#contextmanager
def SuppressPandasWarning():
with pd.option_context("mode.chained_assignment", None):
yield
Then an example:
import pandas as pd
from string import ascii_letters
a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})
mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask] # .copy(deep=False)
# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2
# Does not!
with SuppressPandasWarning():
b["B"] = b["B"] * 2
It is worth noticing is that both approaches do not modify a, which is a bit surprising to me, and even a shallow df copy with .copy(deep=False) would prevent this warning to be raised (as far as I understand, shallow copy should at least modify a as well, but it doesn't. pandas magic.).

This might apply to NumPy only, which means you might need to import it, but the data I used for my examples NumPy was not essential with the calculations, but you can simply stop this settingwithcopy warning message, by using this one line of code below:
np.warnings.filterwarnings('ignore')

Follow-up beginner question / remark
Maybe a clarification for other beginners like me (I come from R which seems to work a bit differently under the hood). The following harmless-looking and functional code kept producing the SettingWithCopy warning, and I couldn't figure out why. I had both read and understood the issued with "chained indexing", but my code doesn't contain any:
def plot(pdb, df, title, **kw):
df['target'] = (df['ogg'] + df['ugg']) / 2
# ...
But then, later, much too late, I looked at where the plot() function is called:
df = data[data['anz_emw'] > 0]
pixbuf = plot(pdb, df, title)
So "df" isn't a data frame, but an object that somehow remembers that it was created by indexing a data frame (so is that a view?) which would make the line in plot(),
df['target'] = ...
equivalent to
data[data['anz_emw'] > 0]['target'] = ...
which is a chained indexing.
Anyway,
def plot(pdb, df, title, **kw):
df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2
fixed it.

You could avoid the whole problem like this, I believe:
return (
pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
.ix[:,[0,3,2,1,4,5,8,9,30,31]]
.assign(
TClose=lambda df: df['TPrice'],
RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
TVol=lambda df: df['TVol']/TVOL_SCALE,
TAmt=lambda df: df['TAmt']/TAMT_SCALE,
STK_ID=lambda df: df['STK'].str.slice(13,19),
STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
)
)
Using Assign. From the documentation: Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
See Tom Augspurger's article on method chaining in pandas: Modern Pandas (Part 2): Method Chaining

I was facing the same warning, while I executed this part of my code:
def scaler(self, numericals):
scaler = MinMaxScaler()
self.data.loc[:, numericals[0]] = scaler.fit_transform(self.data.loc[:, numericals[0]])
self.data.loc[:, numericals[1]] = scaler.fit_transform(self.data.loc[:, numericals[1]])
where scaler is a MinMaxScaler and numericals[0] contains names of three of my numerical columns.
The warning was removed as I changed the code to:
def scaler(self, numericals):
scaler = MinMaxScaler()
self.data.loc[:][numericals[0]] = scaler.fit_transform(self.data.loc[:][numericals[0]])
self.data.loc[:][numericals[1]] = scaler.fit_transform(self.data.loc[:][numericals[1]])
So, just change [:, ~] to [:][~].

If you have assigned the slice to a variable and want to set using the variable as in the following:
df2 = df[df['A'] > 2]
df2['B'] = value
And you do not want to use Jeff's solution, because your condition computing df2 is to long or for some other reason, then you can use the following:
df.loc[df2.index.tolist(), 'B'] = value
df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe.

In my case, I would create a new column based on the index, but I got the same warning as you:
df_temp["Quarter"] = df_temp.index.quarter
I use insert() instead of direct assignment, and it works for me:
df_temp.insert(loc=0, column='Quarter', value=df_temp.index.quarter)

For me this issue occurred in a following simplified example. And I was also able to solve it (hopefully with a correct solution):
Old code with warning:
def update_old_dataframe(old_dataframe, new_dataframe):
for new_index, new_row in new_dataframe.iterrorws():
old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)
def update_row(old_row, new_row):
for field in [list_of_columns]:
# line with warning because of chain indexing old_dataframe[new_index][field]
old_row[field] = new_row[field]
return old_row
This printed the warning for the line old_row[field] = new_row[field]
Since the rows in update_row method are actually type Series, I replaced the line with:
old_row.at[field] = new_row.at[field]
I.e., a method for accessing/lookups for a Series. Even though both works just fine and the result is same, this way I don't have to disable the warnings (=keep them for other chain indexing issues somewhere else).

Just create a copy of your dataframe(s) using the .copy() method before the warning appears, to remove all of your warnings.
This happens, because we do not want to make changes to the original quote_df. In other words, we do not want to play with the reference of the object of the quote_df which we have created for quote_df.
quote_df = quote_df.copy()

How to properly make a pandas mask [duplicate]

Background
I just upgraded my Pandas from 0.11 to 0.13.0rc1. Now, the application is popping out many new warnings. One of them like this:
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
I want to know what exactly it means? Do I need to change something?
How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE?
The function that gives errors
def _decode_stock_quote(list_of_150_stk_str):
"""decode the webpage and return dataframe"""
from cStringIO import StringIO
str_of_all = "".join(list_of_150_stk_str)
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
quote_df['TClose'] = quote_df['TPrice']
quote_df['RT'] = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])
return quote_df
More error messages
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which does not always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]
df[df['A'] > 2]['B'] = new_val # new_val not set in df
The warning offers a suggestion to rewrite as follows:
df.loc[df['A'] > 2, 'B'] = new_val
However, this doesn't fit your usage, which is equivalent to:
df = df[df['A'] > 2]
df['B'] = new_val
While it's clear that you don't care about writes making it back to the original frame (since you are overwriting the reference to it), unfortunately this pattern cannot be differentiated from the first chained assignment example. Hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you'd like to read further. You can safely disable this new warning with the following assignment.
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
Other Resources
pandas User Guide: Indexing and selecting data
Python Data Science Handbook: Data Indexing and Selection
Real Python: SettingWithCopyWarning in Pandas: Views vs Copies
Dataquest: SettingwithCopyWarning: How to Fix This Warning in Pandas
Towards Data Science: Explaining the SettingWithCopyWarning in pandas

How to deal with SettingWithCopyWarning in Pandas?
This post is meant for readers who,
Would like to understand what this warning means
Would like to understand different ways of suppressing this warning
Would like to understand how to improve their code and follow good practices to avoid this warning in the future.
Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
What is the SettingWithCopyWarning?
To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.
When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A "view" is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a "copy" is a replication of data from the original, and modifying the copy has no effect on the original.
As mentioned by other answers, the SettingWithCopyWarning was created to flag "chained assignment" operations. Consider df in the setup above. Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,
df[df.A > 5]['B']
1 3
2 6
Name: B, dtype: int64
And,
df.loc[df.A > 5, 'B']
1 3
2 6
Name: B, dtype: int64
These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:
df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)
With a single __setitem__ call to df. OTOH, consider this code:
df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B', 4)
Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.
In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.
More can be found in the documentation.
Note
All boolean indexing operations done with loc can also be done with iloc. The only difference is that iloc expects either
integers/positions for index or a numpy array of boolean values, and
integer/position indexes for the columns.
For example,
df.loc[df.A > 5, 'B'] = 4
Can be written nas
df.iloc[(df.A > 5).values, 1] = 4
And,
df.loc[1, 'A'] = 100
Can be written as
df.iloc[1, 0] = 100
And so on.
Just tell me how to suppress the warning!
Consider a simple operation on the "A" column of df. Selecting "A" and dividing by 2 will raise the warning, but the operation will work.
df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
df2
A
0 2.5
1 4.5
2 3.5
There are a couple ways of directly silencing this warning:
(recommended) Use loc to slice subsets:
df2 = df.loc[:, ['A']]
df2['A'] /= 2 # Does not raise
Change pd.options.mode.chained_assignment
Can be set to None, "warn", or "raise". "warn" is the default. None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError, preventing the operation from going through.
pd.options.mode.chained_assignment = None
df2['A'] /= 2
Make a deepcopy
df2 = df[['A']].copy(deep=True)
df2['A'] /= 2
#Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.
class ChainedAssignent:
def __init__(self, chained=None):
acceptable = [None, 'warn', 'raise']
assert chained in acceptable, "chained must be in " + str(acceptable)
self.swcw = chained
def __enter__(self):
self.saved_swcw = pd.options.mode.chained_assignment
pd.options.mode.chained_assignment = self.swcw
return self
def __exit__(self, *args):
pd.options.mode.chained_assignment = self.saved_swcw
The usage is as follows:
# Some code here
with ChainedAssignent():
df2['A'] /= 2
# More code follows
Or, to raise the exception
with ChainedAssignent(chained='raise'):
df2['A'] /= 2
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The "XY Problem": What am I doing wrong?
A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem "Y" that is actually a symptom of a deeper rooted problem "X". Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.
Question 1
I have a DataFrame
df
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
I want to assign values in col "A" > 5 to 1000. My expected output is
A B C D E
0 5 0 3 3 7
1 1000 3 5 2 4
2 1000 6 8 8 1
Wrong way to do this:
df.A[df.A > 5] = 1000 # works, because df.A returns a view
df[df.A > 5]['A'] = 1000 # does not work
df.loc[df.A > 5]['A'] = 1000 # does not work
Right way using loc:
df.loc[df.A > 5, 'A'] = 1000
Question 21
I am trying to set the value in cell (1, 'D') to 12345. My expected output is
A B C D E
0 5 0 3 3 7
1 9 3 5 12345 4
2 7 6 8 8 1
I have tried different ways of accessing this cell, such as
df['D'][1]. What is the best way to do this?
1. This question isn't specifically related to the warning, but
it is good to understand how to do this particular operation correctly
so as to avoid situations where the warning could potentially arise in
future.
You can use any of the following methods to do this.
df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345
Question 3
I am trying to subset values based on some condition. I have a
DataFrame
A B C D E
1 9 3 5 2 4
2 7 6 8 8 1
I would like to assign values in "D" to 123 such that "C" == 5. I
tried
df2.loc[df2.C == 5, 'D'] = 123
Which seems fine but I am still getting the
SettingWithCopyWarning! How do I fix this?
This is actually probably because of code higher up in your pipeline. Did you create df2 from something larger, like
df2 = df[df.A > 5]
? In this case, boolean indexing will return a view, so df2 will reference the original. What you'd need to do is assign df2 to a copy:
df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]
Question 4
I'm trying to drop column "C" in-place from
A B C D E
1 9 3 5 2 4
2 7 6 8 8 1
But using
df2.drop('C', axis=1, inplace=True)
Throws SettingWithCopyWarning. Why is this happening?
This is because df2 must have been created as a view from some other slicing operation, such as
df2 = df[df.A > 5]
The solution here is to either make a copy() of df, or use loc, as before.

In general the point of the SettingWithCopyWarning is to show users (and especially new users) that they may be operating on a copy and not the original as they think. There are false positives (IOW if you know what you are doing it could be ok). One possibility is simply to turn off the (by default warn) warning as #Garrett suggest.
Here is another option:
In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [2]: dfa = df.ix[:, [1, 0]]
In [3]: dfa.is_copy
Out[3]: True
In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
#!/usr/local/bin/python
You can set the is_copy flag to False, which will effectively turn off the check, for that object:
In [5]: dfa.is_copy = False
In [6]: dfa['A'] /= 2
If you explicitly copy then no further warning will happen:
In [7]: dfa = df.ix[:, [1, 0]].copy()
In [8]: dfa['A'] /= 2
The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. Another way to not have the warning would be to do the selection operation via reindex, e.g.
quote_df = quote_df.reindex(columns=['STK', ...])
Or,
quote_df = quote_df.reindex(['STK', ...], axis=1) # v.0.21

Here I answer the question directly. How can we deal with it?
Make a .copy(deep=False) after you slice. See pandas.DataFrame.copy.
Wait, doesn't a slice return a copy? After all, this is what the warning message is attempting to say? Read the long answer:
import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})
This gives a warning:
df0 = df[df.x>2]
df0['foo'] = 'bar'
This does not:
df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'
Both df0 and df1 are DataFrame objects, but something about them is different that enables pandas to print the warning. Let's find out what it is.
import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)
Using your diff tool of choice, you will see that beyond a couple of addresses, the only material difference is this:
| | slice | slice_copy |
| _is_copy | weakref | None |
The method that decides whether to warn is DataFrame._check_setitem_copy which checks _is_copy. So here you go. Make a copy so that your DataFrame is not _is_copy.
The warning is suggesting to use .loc, but if you use .loc on a frame that _is_copy, you will still get the same warning. Misleading? Yes. Annoying? You bet. Helpful? Potentially, when chained assignment is used. But it cannot correctly detect chain assignment and prints the warning indiscriminately.

Pandas dataframe copy warning
When you go and do something like this:
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
pandas.ix in this case returns a new, stand alone dataframe.
Any values you decide to change in this dataframe, will not change the original dataframe.
This is what pandas tries to warn you about.
Why .ix is a bad idea
The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell.
Given this dataframe:
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})
Two behaviors:
dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2
Behavior one: dfcopy is now a stand alone dataframe. Changing it will not change df
df.ix[0, "a"] = 3
Behavior two: This changes the original dataframe.
Use .loc instead
The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. (The other being .iloc)
.loc is faster, because it does not try to create a copy of the data.
.loc is meant to modify your existing dataframe inplace, which is more memory efficient.
.loc is predictable, it has one behavior.
The solution
What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller.
The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster.
So instead of doing this
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
Do this
columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns
This will only read the columns you are interested in, and name them properly. No need for using the evil .ix object to do magical stuff.

This topic is really confusing with Pandas. Luckily, it has a relatively simple solution.
The problem is that it is not always clear whether data filtering operations (e.g. loc) return a copy or a view of the DataFrame. Further use of such filtered DataFrame could therefore be confusing.
The simple solution is (unless you need to work with very large sets of data):
Whenever you need to update any values, always make sure that you explicitly copy the DataFrame before the assignment.
df # Some DataFrame
df = df.loc[:, 0:2] # Some filtering (unsure whether a view or copy is returned)
df = df.copy() # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny" # Assignment can be done now (no warning)

I had been getting this issue with .apply() when assigning a new dataframe from a pre-existing dataframe on which I've used the .query() method. For instance:
prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)
Would return this error. The fix that seems to resolve the error in this case is by changing this to:
prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)
However, this is not efficient especially when using large dataframes, due to having to make a new copy.
If you're using the .apply() method in generating a new column and its values, a fix that resolves the error and is more efficient is by adding .reset_index(drop=True):
prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

Just simply:
import pandas as pd
# ...
pd.set_option('mode.chained_assignment', None)

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy.
This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation - especially if the copy occurs in a loop like it did for me, etc...)
To be clear, here is the warning I received:
/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Illustration
I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice.
Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning.
Example 1: dropping a column on the original affects the copy
We knew that already but this is a healthy reminder. This is NOT what the warning is about.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
B
0 121
1 122
2 123
It is possible to avoid changes made on df1 to affect df2. Note: you can avoid importing copy.deepcopy by doing df.copy() instead.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
A B
0 111 121
1 112 122
2 113 123
Example 2: dropping a column on the copy may affect the original
This actually illustrates the warning.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1
B
0 121
1 122
2 123
It is possible to avoid changes made on df2 to affect df1
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
>> df2.drop('A', axis=1, inplace=True)
>> df1
A B
0 111 121
1 112 122
2 113 123

This should work:
quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

Some may want to simply suppress the warning:
class SupressSettingWithCopyWarning:
def __enter__(self):
pd.options.mode.chained_assignment = None
def __exit__(self, *args):
pd.options.mode.chained_assignment = 'warn'
with SupressSettingWithCopyWarning():
#code that produces warning

As this question is already fully explained and discussed in existing answers, I will just provide a neat pandas approach to the context manager using pandas.option_context (links to documentation and example) - there is absolutely isn't any need to create a custom class with all the dunder methods and other bells and whistles.
First the context manager code itself:
from contextlib import contextmanager
#contextmanager
def SuppressPandasWarning():
with pd.option_context("mode.chained_assignment", None):
yield
Then an example:
import pandas as pd
from string import ascii_letters
a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})
mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask] # .copy(deep=False)
# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2
# Does not!
with SuppressPandasWarning():
b["B"] = b["B"] * 2
It is worth noticing is that both approaches do not modify a, which is a bit surprising to me, and even a shallow df copy with .copy(deep=False) would prevent this warning to be raised (as far as I understand, shallow copy should at least modify a as well, but it doesn't. pandas magic.).

This might apply to NumPy only, which means you might need to import it, but the data I used for my examples NumPy was not essential with the calculations, but you can simply stop this settingwithcopy warning message, by using this one line of code below:
np.warnings.filterwarnings('ignore')

Follow-up beginner question / remark
Maybe a clarification for other beginners like me (I come from R which seems to work a bit differently under the hood). The following harmless-looking and functional code kept producing the SettingWithCopy warning, and I couldn't figure out why. I had both read and understood the issued with "chained indexing", but my code doesn't contain any:
def plot(pdb, df, title, **kw):
df['target'] = (df['ogg'] + df['ugg']) / 2
# ...
But then, later, much too late, I looked at where the plot() function is called:
df = data[data['anz_emw'] > 0]
pixbuf = plot(pdb, df, title)
So "df" isn't a data frame, but an object that somehow remembers that it was created by indexing a data frame (so is that a view?) which would make the line in plot(),
df['target'] = ...
equivalent to
data[data['anz_emw'] > 0]['target'] = ...
which is a chained indexing.
Anyway,
def plot(pdb, df, title, **kw):
df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2
fixed it.

You could avoid the whole problem like this, I believe:
return (
pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
.ix[:,[0,3,2,1,4,5,8,9,30,31]]
.assign(
TClose=lambda df: df['TPrice'],
RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
TVol=lambda df: df['TVol']/TVOL_SCALE,
TAmt=lambda df: df['TAmt']/TAMT_SCALE,
STK_ID=lambda df: df['STK'].str.slice(13,19),
STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
)
)
Using Assign. From the documentation: Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
See Tom Augspurger's article on method chaining in pandas: Modern Pandas (Part 2): Method Chaining

I was facing the same warning, while I executed this part of my code:
def scaler(self, numericals):
scaler = MinMaxScaler()
self.data.loc[:, numericals[0]] = scaler.fit_transform(self.data.loc[:, numericals[0]])
self.data.loc[:, numericals[1]] = scaler.fit_transform(self.data.loc[:, numericals[1]])
where scaler is a MinMaxScaler and numericals[0] contains names of three of my numerical columns.
The warning was removed as I changed the code to:
def scaler(self, numericals):
scaler = MinMaxScaler()
self.data.loc[:][numericals[0]] = scaler.fit_transform(self.data.loc[:][numericals[0]])
self.data.loc[:][numericals[1]] = scaler.fit_transform(self.data.loc[:][numericals[1]])
So, just change [:, ~] to [:][~].

If you have assigned the slice to a variable and want to set using the variable as in the following:
df2 = df[df['A'] > 2]
df2['B'] = value
And you do not want to use Jeff's solution, because your condition computing df2 is to long or for some other reason, then you can use the following:
df.loc[df2.index.tolist(), 'B'] = value
df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe.

In my case, I would create a new column based on the index, but I got the same warning as you:
df_temp["Quarter"] = df_temp.index.quarter
I use insert() instead of direct assignment, and it works for me:
df_temp.insert(loc=0, column='Quarter', value=df_temp.index.quarter)

For me this issue occurred in a following simplified example. And I was also able to solve it (hopefully with a correct solution):
Old code with warning:
def update_old_dataframe(old_dataframe, new_dataframe):
for new_index, new_row in new_dataframe.iterrorws():
old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)
def update_row(old_row, new_row):
for field in [list_of_columns]:
# line with warning because of chain indexing old_dataframe[new_index][field]
old_row[field] = new_row[field]
return old_row
This printed the warning for the line old_row[field] = new_row[field]
Since the rows in update_row method are actually type Series, I replaced the line with:
old_row.at[field] = new_row.at[field]
I.e., a method for accessing/lookups for a Series. Even though both works just fine and the result is same, this way I don't have to disable the warnings (=keep them for other chain indexing issues somewhere else).

Just create a copy of your dataframe(s) using the .copy() method before the warning appears, to remove all of your warnings.
This happens, because we do not want to make changes to the original quote_df. In other words, we do not want to play with the reference of the object of the quote_df which we have created for quote_df.
quote_df = quote_df.copy()

Pandas Copy Warnings [duplicate]

Background
I just upgraded my Pandas from 0.11 to 0.13.0rc1. Now, the application is popping out many new warnings. One of them like this:
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
I want to know what exactly it means? Do I need to change something?
How should I suspend the warning if I insist to use quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE?
The function that gives errors
def _decode_stock_quote(list_of_150_stk_str):
"""decode the webpage and return dataframe"""
from cStringIO import StringIO
str_of_all = "".join(list_of_150_stk_str)
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
quote_df['TClose'] = quote_df['TPrice']
quote_df['RT'] = 100 * (quote_df['TPrice']/quote_df['TPCLOSE'] - 1)
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
quote_df['STK_ID'] = quote_df['STK'].str.slice(13,19)
quote_df['STK_Name'] = quote_df['STK'].str.slice(21,30)#.decode('gb2312')
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])
return quote_df
More error messages
E:\FinReporter\FM_EXT.py:449: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TVol'] = quote_df['TVol']/TVOL_SCALE
E:\FinReporter\FM_EXT.py:450: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TAmt'] = quote_df['TAmt']/TAMT_SCALE
E:\FinReporter\FM_EXT.py:453: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
quote_df['TDate'] = quote_df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10])

The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments, such as the following, which does not always work as expected, particularly when the first selection returns a copy. [see GH5390 and GH5597 for background discussion.]
df[df['A'] > 2]['B'] = new_val # new_val not set in df
The warning offers a suggestion to rewrite as follows:
df.loc[df['A'] > 2, 'B'] = new_val
However, this doesn't fit your usage, which is equivalent to:
df = df[df['A'] > 2]
df['B'] = new_val
While it's clear that you don't care about writes making it back to the original frame (since you are overwriting the reference to it), unfortunately this pattern cannot be differentiated from the first chained assignment example. Hence the (false positive) warning. The potential for false positives is addressed in the docs on indexing, if you'd like to read further. You can safely disable this new warning with the following assignment.
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
Other Resources
pandas User Guide: Indexing and selecting data
Python Data Science Handbook: Data Indexing and Selection
Real Python: SettingWithCopyWarning in Pandas: Views vs Copies
Dataquest: SettingwithCopyWarning: How to Fix This Warning in Pandas
Towards Data Science: Explaining the SettingWithCopyWarning in pandas

How to deal with SettingWithCopyWarning in Pandas?
This post is meant for readers who,
Would like to understand what this warning means
Would like to understand different ways of suppressing this warning
Would like to understand how to improve their code and follow good practices to avoid this warning in the future.
Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
What is the SettingWithCopyWarning?
To know how to deal with this warning, it is important to understand what it means and why it is raised in the first place.
When filtering DataFrames, it is possible slice/index a frame to return either a view, or a copy, depending on the internal layout and various implementation details. A "view" is, as the term suggests, a view into the original data, so modifying the view may modify the original object. On the other hand, a "copy" is a replication of data from the original, and modifying the copy has no effect on the original.
As mentioned by other answers, the SettingWithCopyWarning was created to flag "chained assignment" operations. Consider df in the setup above. Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,
df[df.A > 5]['B']
1 3
2 6
Name: B, dtype: int64
And,
df.loc[df.A > 5, 'B']
1 3
2 6
Name: B, dtype: int64
These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back. To build on the earlier example, consider how this code is executed by the interpreter:
df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)
With a single __setitem__ call to df. OTOH, consider this code:
df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B', 4)
Now, depending on whether __getitem__ returned a view or a copy, the __setitem__ operation may not work.
In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.
More can be found in the documentation.
Note
All boolean indexing operations done with loc can also be done with iloc. The only difference is that iloc expects either
integers/positions for index or a numpy array of boolean values, and
integer/position indexes for the columns.
For example,
df.loc[df.A > 5, 'B'] = 4
Can be written nas
df.iloc[(df.A > 5).values, 1] = 4
And,
df.loc[1, 'A'] = 100
Can be written as
df.iloc[1, 0] = 100
And so on.
Just tell me how to suppress the warning!
Consider a simple operation on the "A" column of df. Selecting "A" and dividing by 2 will raise the warning, but the operation will work.
df2 = df[['A']]
df2['A'] /= 2
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
df2
A
0 2.5
1 4.5
2 3.5
There are a couple ways of directly silencing this warning:
(recommended) Use loc to slice subsets:
df2 = df.loc[:, ['A']]
df2['A'] /= 2 # Does not raise
Change pd.options.mode.chained_assignment
Can be set to None, "warn", or "raise". "warn" is the default. None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError, preventing the operation from going through.
pd.options.mode.chained_assignment = None
df2['A'] /= 2
Make a deepcopy
df2 = df[['A']].copy(deep=True)
df2['A'] /= 2
#Peter Cotton in the comments, came up with a nice way of non-intrusively changing the mode (modified from this gist) using a context manager, to set the mode only as long as it is required, and the reset it back to the original state when finished.
class ChainedAssignent:
def __init__(self, chained=None):
acceptable = [None, 'warn', 'raise']
assert chained in acceptable, "chained must be in " + str(acceptable)
self.swcw = chained
def __enter__(self):
self.saved_swcw = pd.options.mode.chained_assignment
pd.options.mode.chained_assignment = self.swcw
return self
def __exit__(self, *args):
pd.options.mode.chained_assignment = self.saved_swcw
The usage is as follows:
# Some code here
with ChainedAssignent():
df2['A'] /= 2
# More code follows
Or, to raise the exception
with ChainedAssignent(chained='raise'):
df2['A'] /= 2
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The "XY Problem": What am I doing wrong?
A lot of the time, users attempt to look for ways of suppressing this exception without fully understanding why it was raised in the first place. This is a good example of an XY problem, where users attempt to solve a problem "Y" that is actually a symptom of a deeper rooted problem "X". Questions will be raised based on common problems that encounter this warning, and solutions will then be presented.
Question 1
I have a DataFrame
df
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
I want to assign values in col "A" > 5 to 1000. My expected output is
A B C D E
0 5 0 3 3 7
1 1000 3 5 2 4
2 1000 6 8 8 1
Wrong way to do this:
df.A[df.A > 5] = 1000 # works, because df.A returns a view
df[df.A > 5]['A'] = 1000 # does not work
df.loc[df.A > 5]['A'] = 1000 # does not work
Right way using loc:
df.loc[df.A > 5, 'A'] = 1000
Question 21
I am trying to set the value in cell (1, 'D') to 12345. My expected output is
A B C D E
0 5 0 3 3 7
1 9 3 5 12345 4
2 7 6 8 8 1
I have tried different ways of accessing this cell, such as
df['D'][1]. What is the best way to do this?
1. This question isn't specifically related to the warning, but
it is good to understand how to do this particular operation correctly
so as to avoid situations where the warning could potentially arise in
future.
You can use any of the following methods to do this.
df.loc[1, 'D'] = 12345
df.iloc[1, 3] = 12345
df.at[1, 'D'] = 12345
df.iat[1, 3] = 12345
Question 3
I am trying to subset values based on some condition. I have a
DataFrame
A B C D E
1 9 3 5 2 4
2 7 6 8 8 1
I would like to assign values in "D" to 123 such that "C" == 5. I
tried
df2.loc[df2.C == 5, 'D'] = 123
Which seems fine but I am still getting the
SettingWithCopyWarning! How do I fix this?
This is actually probably because of code higher up in your pipeline. Did you create df2 from something larger, like
df2 = df[df.A > 5]
? In this case, boolean indexing will return a view, so df2 will reference the original. What you'd need to do is assign df2 to a copy:
df2 = df[df.A > 5].copy()
# Or,
# df2 = df.loc[df.A > 5, :]
Question 4
I'm trying to drop column "C" in-place from
A B C D E
1 9 3 5 2 4
2 7 6 8 8 1
But using
df2.drop('C', axis=1, inplace=True)
Throws SettingWithCopyWarning. Why is this happening?
This is because df2 must have been created as a view from some other slicing operation, such as
df2 = df[df.A > 5]
The solution here is to either make a copy() of df, or use loc, as before.

In general the point of the SettingWithCopyWarning is to show users (and especially new users) that they may be operating on a copy and not the original as they think. There are false positives (IOW if you know what you are doing it could be ok). One possibility is simply to turn off the (by default warn) warning as #Garrett suggest.
Here is another option:
In [1]: df = DataFrame(np.random.randn(5, 2), columns=list('AB'))
In [2]: dfa = df.ix[:, [1, 0]]
In [3]: dfa.is_copy
Out[3]: True
In [4]: dfa['A'] /= 2
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
#!/usr/local/bin/python
You can set the is_copy flag to False, which will effectively turn off the check, for that object:
In [5]: dfa.is_copy = False
In [6]: dfa['A'] /= 2
If you explicitly copy then no further warning will happen:
In [7]: dfa = df.ix[:, [1, 0]].copy()
In [8]: dfa['A'] /= 2
The code the OP is showing above, while legitimate, and probably something I do as well, is technically a case for this warning, and not a false positive. Another way to not have the warning would be to do the selection operation via reindex, e.g.
quote_df = quote_df.reindex(columns=['STK', ...])
Or,
quote_df = quote_df.reindex(['STK', ...], axis=1) # v.0.21

Here I answer the question directly. How can we deal with it?
Make a .copy(deep=False) after you slice. See pandas.DataFrame.copy.
Wait, doesn't a slice return a copy? After all, this is what the warning message is attempting to say? Read the long answer:
import pandas as pd
df = pd.DataFrame({'x':[1,2,3]})
This gives a warning:
df0 = df[df.x>2]
df0['foo'] = 'bar'
This does not:
df1 = df[df.x>2].copy(deep=False)
df1['foo'] = 'bar'
Both df0 and df1 are DataFrame objects, but something about them is different that enables pandas to print the warning. Let's find out what it is.
import inspect
slice= df[df.x>2]
slice_copy = df[df.x>2].copy(deep=False)
inspect.getmembers(slice)
inspect.getmembers(slice_copy)
Using your diff tool of choice, you will see that beyond a couple of addresses, the only material difference is this:
| | slice | slice_copy |
| _is_copy | weakref | None |
The method that decides whether to warn is DataFrame._check_setitem_copy which checks _is_copy. So here you go. Make a copy so that your DataFrame is not _is_copy.
The warning is suggesting to use .loc, but if you use .loc on a frame that _is_copy, you will still get the same warning. Misleading? Yes. Annoying? You bet. Helpful? Potentially, when chained assignment is used. But it cannot correctly detect chain assignment and prints the warning indiscriminately.

Pandas dataframe copy warning
When you go and do something like this:
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
pandas.ix in this case returns a new, stand alone dataframe.
Any values you decide to change in this dataframe, will not change the original dataframe.
This is what pandas tries to warn you about.
Why .ix is a bad idea
The .ix object tries to do more than one thing, and for anyone who has read anything about clean code, this is a strong smell.
Given this dataframe:
df = pd.DataFrame({"a": [1,2,3,4], "b": [1,1,2,2]})
Two behaviors:
dfcopy = df.ix[:,["a"]]
dfcopy.a.ix[0] = 2
Behavior one: dfcopy is now a stand alone dataframe. Changing it will not change df
df.ix[0, "a"] = 3
Behavior two: This changes the original dataframe.
Use .loc instead
The pandas developers recognized that the .ix object was quite smelly[speculatively] and thus created two new objects which helps in the accession and assignment of data. (The other being .iloc)
.loc is faster, because it does not try to create a copy of the data.
.loc is meant to modify your existing dataframe inplace, which is more memory efficient.
.loc is predictable, it has one behavior.
The solution
What you are doing in your code example is loading a big file with lots of columns, then modifying it to be smaller.
The pd.read_csv function can help you out with a lot of this and also make the loading of the file a lot faster.
So instead of doing this
quote_df = pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
quote_df.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
quote_df = quote_df.ix[:,[0,3,2,1,4,5,8,9,30,31]]
Do this
columns = ['STK', 'TPrice', 'TPCLOSE', 'TOpen', 'THigh', 'TLow', 'TVol', 'TAmt', 'TDate', 'TTime']
df = pd.read_csv(StringIO(str_of_all), sep=',', usecols=[0,3,2,1,4,5,8,9,30,31])
df.columns = columns
This will only read the columns you are interested in, and name them properly. No need for using the evil .ix object to do magical stuff.

This topic is really confusing with Pandas. Luckily, it has a relatively simple solution.
The problem is that it is not always clear whether data filtering operations (e.g. loc) return a copy or a view of the DataFrame. Further use of such filtered DataFrame could therefore be confusing.
The simple solution is (unless you need to work with very large sets of data):
Whenever you need to update any values, always make sure that you explicitly copy the DataFrame before the assignment.
df # Some DataFrame
df = df.loc[:, 0:2] # Some filtering (unsure whether a view or copy is returned)
df = df.copy() # Ensuring a copy is made
df[df["Name"] == "John"] = "Johny" # Assignment can be done now (no warning)

I had been getting this issue with .apply() when assigning a new dataframe from a pre-existing dataframe on which I've used the .query() method. For instance:
prop_df = df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)
Would return this error. The fix that seems to resolve the error in this case is by changing this to:
prop_df = df.copy(deep=True)
prop_df = prop_df.query('column == "value"')
prop_df['new_column'] = prop_df.apply(function, axis=1)
However, this is not efficient especially when using large dataframes, due to having to make a new copy.
If you're using the .apply() method in generating a new column and its values, a fix that resolves the error and is more efficient is by adding .reset_index(drop=True):
prop_df = df.query('column == "value"').reset_index(drop=True)
prop_df['new_column'] = prop_df.apply(function, axis=1)

Just simply:
import pandas as pd
# ...
pd.set_option('mode.chained_assignment', None)

To remove any doubt, my solution was to make a deep copy of the slice instead of a regular copy.
This may not be applicable depending on your context (Memory constraints / size of the slice, potential for performance degradation - especially if the copy occurs in a loop like it did for me, etc...)
To be clear, here is the warning I received:
/opt/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:54:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Illustration
I had doubts that the warning was thrown because of a column I was dropping on a copy of the slice. While not technically trying to set a value in the copy of the slice, that was still a modification of the copy of the slice.
Below are the (simplified) steps I have taken to confirm the suspicion, I hope it will help those of us who are trying to understand the warning.
Example 1: dropping a column on the original affects the copy
We knew that already but this is a healthy reminder. This is NOT what the warning is about.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 affects df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
B
0 121
1 122
2 123
It is possible to avoid changes made on df1 to affect df2. Note: you can avoid importing copy.deepcopy by doing df.copy() instead.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df1 does not affect df2
>> df1.drop('A', axis=1, inplace=True)
>> df2
A B
0 111 121
1 112 122
2 113 123
Example 2: dropping a column on the copy may affect the original
This actually illustrates the warning.
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> df2 = df1
>> df2
A B
0 111 121
1 112 122
2 113 123
# Dropping a column on df2 can affect df1
# No slice involved here, but I believe the principle remains the same?
# Let me know if not
>> df2.drop('A', axis=1, inplace=True)
>> df1
B
0 121
1 122
2 123
It is possible to avoid changes made on df2 to affect df1
>> data1 = {'A': [111, 112, 113], 'B':[121, 122, 123]}
>> df1 = pd.DataFrame(data1)
>> df1
A B
0 111 121
1 112 122
2 113 123
>> import copy
>> df2 = copy.deepcopy(df1)
>> df2
A B
0 111 121
1 112 122
2 113 123
>> df2.drop('A', axis=1, inplace=True)
>> df1
A B
0 111 121
1 112 122
2 113 123

This should work:
quote_df.loc[:,'TVol'] = quote_df['TVol']/TVOL_SCALE

Some may want to simply suppress the warning:
class SupressSettingWithCopyWarning:
def __enter__(self):
pd.options.mode.chained_assignment = None
def __exit__(self, *args):
pd.options.mode.chained_assignment = 'warn'
with SupressSettingWithCopyWarning():
#code that produces warning

As this question is already fully explained and discussed in existing answers, I will just provide a neat pandas approach to the context manager using pandas.option_context (links to documentation and example) - there is absolutely isn't any need to create a custom class with all the dunder methods and other bells and whistles.
First the context manager code itself:
from contextlib import contextmanager
#contextmanager
def SuppressPandasWarning():
with pd.option_context("mode.chained_assignment", None):
yield
Then an example:
import pandas as pd
from string import ascii_letters
a = pd.DataFrame({"A": list(ascii_letters[0:4]), "B": range(0,4)})
mask = a["A"].isin(["c", "d"])
# Even shallow copy below is enough to not raise the warning, but why is a mystery to me.
b = a.loc[mask] # .copy(deep=False)
# Raises the `SettingWithCopyWarning`
b["B"] = b["B"] * 2
# Does not!
with SuppressPandasWarning():
b["B"] = b["B"] * 2
It is worth noticing is that both approaches do not modify a, which is a bit surprising to me, and even a shallow df copy with .copy(deep=False) would prevent this warning to be raised (as far as I understand, shallow copy should at least modify a as well, but it doesn't. pandas magic.).

This might apply to NumPy only, which means you might need to import it, but the data I used for my examples NumPy was not essential with the calculations, but you can simply stop this settingwithcopy warning message, by using this one line of code below:
np.warnings.filterwarnings('ignore')

Follow-up beginner question / remark
Maybe a clarification for other beginners like me (I come from R which seems to work a bit differently under the hood). The following harmless-looking and functional code kept producing the SettingWithCopy warning, and I couldn't figure out why. I had both read and understood the issued with "chained indexing", but my code doesn't contain any:
def plot(pdb, df, title, **kw):
df['target'] = (df['ogg'] + df['ugg']) / 2
# ...
But then, later, much too late, I looked at where the plot() function is called:
df = data[data['anz_emw'] > 0]
pixbuf = plot(pdb, df, title)
So "df" isn't a data frame, but an object that somehow remembers that it was created by indexing a data frame (so is that a view?) which would make the line in plot(),
df['target'] = ...
equivalent to
data[data['anz_emw'] > 0]['target'] = ...
which is a chained indexing.
Anyway,
def plot(pdb, df, title, **kw):
df.loc[:,'target'] = (df['ogg'] + df['ugg']) / 2
fixed it.

You could avoid the whole problem like this, I believe:
return (
pd.read_csv(StringIO(str_of_all), sep=',', names=list('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg')) #dtype={'A': object, 'B': object, 'C': np.float64}
.rename(columns={'A':'STK', 'B':'TOpen', 'C':'TPCLOSE', 'D':'TPrice', 'E':'THigh', 'F':'TLow', 'I':'TVol', 'J':'TAmt', 'e':'TDate', 'f':'TTime'}, inplace=True)
.ix[:,[0,3,2,1,4,5,8,9,30,31]]
.assign(
TClose=lambda df: df['TPrice'],
RT=lambda df: 100 * (df['TPrice']/quote_df['TPCLOSE'] - 1),
TVol=lambda df: df['TVol']/TVOL_SCALE,
TAmt=lambda df: df['TAmt']/TAMT_SCALE,
STK_ID=lambda df: df['STK'].str.slice(13,19),
STK_Name=lambda df: df['STK'].str.slice(21,30)#.decode('gb2312'),
TDate=lambda df: df.TDate.map(lambda x: x[0:4]+x[5:7]+x[8:10]),
)
)
Using Assign. From the documentation: Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones.
See Tom Augspurger's article on method chaining in pandas: Modern Pandas (Part 2): Method Chaining

I was facing the same warning, while I executed this part of my code:
def scaler(self, numericals):
scaler = MinMaxScaler()
self.data.loc[:, numericals[0]] = scaler.fit_transform(self.data.loc[:, numericals[0]])
self.data.loc[:, numericals[1]] = scaler.fit_transform(self.data.loc[:, numericals[1]])
where scaler is a MinMaxScaler and numericals[0] contains names of three of my numerical columns.
The warning was removed as I changed the code to:
def scaler(self, numericals):
scaler = MinMaxScaler()
self.data.loc[:][numericals[0]] = scaler.fit_transform(self.data.loc[:][numericals[0]])
self.data.loc[:][numericals[1]] = scaler.fit_transform(self.data.loc[:][numericals[1]])
So, just change [:, ~] to [:][~].

If you have assigned the slice to a variable and want to set using the variable as in the following:
df2 = df[df['A'] > 2]
df2['B'] = value
And you do not want to use Jeff's solution, because your condition computing df2 is to long or for some other reason, then you can use the following:
df.loc[df2.index.tolist(), 'B'] = value
df2.index.tolist() returns the indices from all entries in df2, which will then be used to set column B in the original dataframe.

In my case, I would create a new column based on the index, but I got the same warning as you:
df_temp["Quarter"] = df_temp.index.quarter
I use insert() instead of direct assignment, and it works for me:
df_temp.insert(loc=0, column='Quarter', value=df_temp.index.quarter)

For me this issue occurred in a following simplified example. And I was also able to solve it (hopefully with a correct solution):
Old code with warning:
def update_old_dataframe(old_dataframe, new_dataframe):
for new_index, new_row in new_dataframe.iterrorws():
old_dataframe.loc[new_index] = update_row(old_dataframe.loc[new_index], new_row)
def update_row(old_row, new_row):
for field in [list_of_columns]:
# line with warning because of chain indexing old_dataframe[new_index][field]
old_row[field] = new_row[field]
return old_row
This printed the warning for the line old_row[field] = new_row[field]
Since the rows in update_row method are actually type Series, I replaced the line with:
old_row.at[field] = new_row.at[field]
I.e., a method for accessing/lookups for a Series. Even though both works just fine and the result is same, this way I don't have to disable the warnings (=keep them for other chain indexing issues somewhere else).

Just create a copy of your dataframe(s) using the .copy() method before the warning appears, to remove all of your warnings.
This happens, because we do not want to make changes to the original quote_df. In other words, we do not want to play with the reference of the object of the quote_df which we have created for quote_df.
quote_df = quote_df.copy()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark, top for DataFrame - apache-spark

Related

Spark UDF error AttributeError: 'NoneType' object has no attribute '_jvm'

Pyspark: aggregate mode (most frequent) value in a rolling window

Syntax for np.where replace column values [duplicate]

How to properly make a pandas mask [duplicate]

Pandas Copy Warnings [duplicate]

Categories

Resources