Pandas copy slice warning appears to be inconsistent? - python-3.x

I know there are a million posts about Pandas DataFrame copy slice warning, and I have researched this... but I still don't understand why the warning is NOT called in Line10 below but IS called in Line15 below. Using Python 3.8.3 & pandas 1.0.5
import pandas as pd
#### Example DataFrame
myid = [1, 1, 1, 2, 2]
myorder = [3, 2, 1, 2, 1]
y = [3642, 3640, 3632, 3628, 3608]
x = [11811, 11812, 11807, 11795, 11795]
df = pd.DataFrame(list(zip(myid, myorder, x, y)),
columns =['myid', 'myorder', 'x', 'y'])
df.sort_values(by=['myid', 'myorder'], inplace=True) ## LINE 10
df.reset_index(drop=True, inplace=True)
idval =2
tempdf = df[mygdf.myid == idval]
tempdf.sort_values(by=['myid', 'myorder'], inplace=True) ## LINE 15
tempdf.reset_index(drop=True, inplace=True)

This line:
tempdf = df[mygdf.myid == idval]
Is creating a view called tempdf on df, the underlying data has not been copied, think of tempdf as a pre-recorded filter applied to df, and further changes will only be applied to those rows that meet the filter.
This means that if you update tempdf, you will be updating df - hence the warning.
To avoid the warning you would need to do the following, which forces df and tempdf to use differing underlying data structures:
tempdf = df[mygdf.myid == idval].copy()
Now changes to tempdf will have no impact on df, so your warning goes away.

Related

Annotate each FacetGrid subplot using custom df (or list) using a func

Consider the following data and FacetGrid:
d = {'SITE':['A', 'B', 'C', 'C', 'A'], 'VF':[0.00, 0.78, 0.99, 1.00, 0.50],'TYPE':['typeA', 'typeA', 'typeB', 'typeC', 'typeD']}
new_df = pd.DataFrame(data=d)
with sns.axes_style("white"):
g = sns.FacetGrid(data=new_df, col='SITE', col_wrap= 3, height=7, aspect=0.25,
hue='TYPE', palette=['#1E88E5', '#FFC107', '#D81B60'])
g.map(sns.scatterplot, 'VF', 'TYPE', s=100)
Using another dataframe:
d = {'SITE':['A', 'B', 'C'], 'N':[10, 5, 7]}
ann_df = pd.DataFrame(data=d)
Where the SITE matches the original new_df['SITE'], and is not the same dimensions as new_df['SITE'], but has the corresponding length of columns in the FacetGrid.
How do you annotate each subplot using a custom func using not the scatterplot new_df, but the ann_df or custom list, if it matches the original new_df['SITE'] and adds the ann_df['N'] to each subplot as shown below:
So, something along these lines or better:
def annotate(data, **kws):
n = data # should be the int for each matching SITE
ax = plt.gca()
ax.text(.1, .2, f"N = {n}", transform=ax.transAxes)
g.map_dataframe(annotate(ann_df))
It is recommended from seaborn v0.11.0 to use figure-level functions like seaborn.relplot instead of seaborn.FacetGrid
The values used for col= will be plotted alphabetically by default, otherwise specify an order with col_order=, and then make sure ann_df['SITE'] is sorted in the same order.
Flatten the seaborn.axisgrid.FacetGrid returned by sns.relplot, iterate through the matplotlib.axes, and add .text to each plot by using i from enumerate with .iloc to index the correct value for 'N'.
Similar to this answer, but getting data from a secondary DataFrame instead of a dict.
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
import seaborn as sns
import pandas as pd
# DataFrame 1
d1 = {'SITE':['A', 'B', 'C', 'C', 'A'],
'VF':[0.00, 0.78, 0.99, 1.00, 0.50],
'TYPE':['typeA', 'typeA', 'typeB', 'typeC', 'typeD']}
df = pd.DataFrame(data=d1)
# DataFrame 2
d2 = {'SITE':['A', 'B', 'C'], 'N':[10, 5, 7]}
ann_df = pd.DataFrame(data=d2)
# plot
g = sns.relplot(kind='scatter', data=df, x='VF', y='TYPE', col='SITE',
col_wrap=3, height=7, aspect=0.5, hue='TYPE', s=100)
# flatten axes into a 1-d array
axes = g.axes.flatten()
# iterate through the axes
for i, ax in enumerate(axes):
ax.text(0, 3, f"N = {ann_df.iloc[i, 1]}")

Unable to read data from kdeplot

I have a pandas dataframe with two columns, A and B, named df in the following bits of code.
And I try to plot a kde for each value of B like so:
import seaborn as sbn, numpy as np, pandas as pd
fig = plt.figure(figsize=(15, 7.5))
sbn.kdeplot(data=df, x="A", hue="B", fill=True)
fig.savefig("test.png")
I read the following propositions but only those where I compute the kde from scratch using statsmodel or some other module get me somewhere:
Seaborn/Matplotlib: how to access line values in FacetGrid?
Get data points from Seaborn distplot
For curiosity's sake, I would like to know why I am unable to get something from the following code:
kde = sns.kdeplot(data=df, x="A", hue="B", fill=True)
line = kde.lines[0]
x, y = line.get_data()
print(x, y)
The error I get is IndexError: list index out of range. kde.lines has a length of 0.
Accessing the lines through fig.axes[0].lines[0] also raises an IndexError.
All in all, I think I tried everything proposed in the previous threads (I tried switching to displot instead of using kdeplot but this is the same story, only that I have to access axes differently, note displot and not distplot because it is deprecated), but every time I get to .get_lines(), ax.lines, ... what is returned is an empty list. So I can't get any values out of it.
EDIT : Reproducible example
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sbn
# 1. Generate random data
df = pd.DataFrame(columns=["A", "B"])
for i in [1, 2, 3, 5, 7, 8, 10, 12, 15, 17, 20, 40, 50]:
for _ in range(10):
df = df.append({"A": np.random.random() * i, "B": i}, ignore_index=True)
# 2. Plot data
fig = plt.figure(figsize=(15, 7.5))
sbn.kdeplot(data=df, x="A", hue="B", fill=True)
# 3. Read data (error)
ax = fig.axes[0]
x, y = ax.lines[0].get_data()
print(x, y)
This happens because using fill=True changes the object that matplotlib draws.
When no fill is used, lines are plotted:
fig = plt.figure(figsize=(15, 7.5))
ax = sbn.kdeplot(data=df, x="A", hue="B")
print(ax.lines)
# [<matplotlib.lines.Line2D object at 0x000001F365EF7848>, etc.]
when you use fill, it changes them to PolyCollection objects
fig = plt.figure(figsize=(15, 7.5))
ax = sbn.kdeplot(data=df, x="A", hue="B", fill=True)
print(ax.collections)
# [<matplotlib.collections.PolyCollection object at 0x0000016EE13F39C8>, etc.]
You could draw the kdeplot a second time, but with fill=False so that you have access to the line objects

Why is a copy of a pandas object altering one column on the original object? (Slice copy)

As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones (I'm not sure when).
However, in this case I make a copy by slicing and, when editing two columns of the copy, one column of the original is altered, but the other is not.
How is it possible? Why one column, and not both or none of them?
Here is the code:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-neural-networks/student-admissions/student_data.csv'
data = pd.read_csv(url)
# Copy data
processed_data = data[:]
print(data[:10])
# Edit copy
processed_data['gre'] = processed_data['gre']/800.0
processed_data['gpa'] = processed_data['gpa']/4.0
# gpa column has changed
print(data[:10])
On the other hand, if I change processed_data = data[:] to processed_data = data.copy() it works fine.
Here, the original data edited:
As I understand, a copy by slicing copies the upper levels of a structure, but not the lower ones.
This is valid for Python lists. Slicing creates shallow copies.
In [44]: lst = [[1, 2], 3, 4]
In [45]: lst2 = lst[:]
In [46]: lst2[1] = 100
In [47]: lst # unchanged
Out[47]: [[1, 2], 3, 4]
In [48]: lst2[0].append(3)
In [49]: lst # changed
Out[49]: [[1, 2, 3], 3, 4]
However, this is not the case for numpy/pandas. numpy, for the most part, returns view when you slice an array.
In [50]: arr = np.array([1, 2, 3])
In [51]: arr2 = arr[:]
In [52]: arr2[0] = 100
In [53]: arr
Out[53]: array([100, 2, 3])
If you have a DataFrame with a single dtype, the behaviour you see is the same:
In [62]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
In [63]: df
Out[63]:
0 1 2
0 1 2 3
1 4 5 6
In [64]: df2 = df[:]
In [65]: df2.iloc[0, 0] = 100
In [66]: df
Out[66]:
0 1 2
0 100 2 3
1 4 5 6
But when you have mixed dtypes, the behavior is not predictable which is the main source of the infamous SettingWithCopyWarning:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard
to predict whether it will return a view or a copy (it depends on the
memory layout of the array, about which pandas makes no guarantees),
and therefore whether the __setitem__ will modify dfmi or a temporary
object that gets thrown out immediately afterward. That’s what
SettingWithCopy is warning you about!
In your case, my guess is that this was the result of how different dtypes are handled in pandas. Each dtype has its own block and in case of the gpa column the block is the column itself. This is not the case for gre -- you have other integer columns. When I add a string column to data and modify it in processed_data I see the same behavior. When I increase the number of float columns to 2 in data, changing gre in processed_data no longer affects original data.
In a nutshell, the behavior is the result of an implementation detail which you shouldn't rely on. If you want to copy DataFrames, you should explicitly use .copy() and if you want to modify parts of DataFrames you shouldn't assign those parts to other variables. You should directly modify them either with .loc or .iloc.

dask array map_blocks, with differently shaped dask array as argument

I'm trying to use dask.array.map_blocks to process a dask array, using a second dask array with different shape as an argument. The use case is firstly running some peak finding on a 2-D stack of images (4-dimensions), which is returned as a 2-D dask array of np.objects. Ergo, the two first dimensions of the two dask arrays are the same. The peaks are then used to extract intensities from the 4-dimensional dataset. In the code below, I've omitted the peak finding part. Dask version 1.0.0.
import numpy as np
import dask.array as da
def test_processing(data_chunk, position_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = position_chunk[islice]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
positions = np.empty(data.shape[:-2], dtype='object')
for index in np.ndindex(positions.shape):
positions[index] = np.arange(10).reshape(5, 2)
data_output = da.map_blocks(test_processing, data, positions, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_output.compute()
This gives the error ValueError: Can't drop an axis with more than 1 block. Please useatopinstead., which I'm guessing is due to positions having 3 dimensions, while data has 4 dimensions.
The same function, but without the positions dask array works fine.
import numpy as np
import dask.array as da
def test_processing(data_chunk):
output_array = np.empty(data_chunk.shape[:-2], dtype='object')
for index in np.ndindex(data_chunk.shape[:-2]):
islice = np.s_[index]
intensity_list = []
data = data_chunk[islice]
positions = [[5, 2], [1, 3]]
for x, y in positions:
intensity_list.append(data[x, y])
output_array[islice] = np.array(intensity_list)
return output_array
data = da.random.random(size=(4, 4, 10, 10), chunks=(2, 2, 10, 10))
data_output = da.map_blocks(test_processing, data, dtype=np.object,
chunks=(2, 2), drop_axis=(2, 3))
data_computed = data_output.compute()
This has been fixed in more recent versions of dask: running the same code on version 2.3.0 of dask works fine.

How to encode multiple categorical columns for test data efficiently?

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
EDIT:
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
l.append(s)
if s == 'gr':
break
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
labels?
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
Update:
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
axis=1,
result_type='expand')
#1000 loops, best of 3: 1.17 ms per loop

Resources