With pandas why does `DataFrame(foo) is foo` = False? - python-3.x

Starting with the example...
In [1]: import pandas as pd
In [2]: from sklearn.datasets import load_iris
In [3]: iris = load_iris()
In [4]: X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
In [5]: output_df = pd.DataFrame(X)
In [6]: X is output_df
Out[6]: False
In [7]: list(X.columns)
Out[7]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
In [8]: output_df['y'] = iris.target
In [9]: list(X.columns)
Out[9]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)',
'y']
[6] says that X is output_df is False, meaning they are not the same object. If they are not the same object, then adding a column to one of them should not affect the other one.
However [9] tells us that adding a column to output_df definitely did add the same column to X, which implies they actually are the same object.
Why is there a disconnect here?
(pd.__version__ == 0.24.1 and python --version = Python 3.7.1, in case it matters)

There's some decoupling between a DataFrame and its underlying data, which is stored in its BlockManager. In your example the underlying BlockManager data is the same, so changing it on one DataFrame will impact the other:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.1'
In [2]: df = pd.DataFrame({'A': list('abc'), 'B': [10, 20, 30]})
In [3]: df2 = pd.DataFrame(df)
In [4]: df is df2
Out[4]: False
In [5]: df._data is df2._data
Out[5]: True
In [6]: df._data
Out[6]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
IntBlock: slice(1, 2, 1), 1 x 3, dtype: int64
ObjectBlock: slice(0, 1, 1), 1 x 3, dtype: object
Essentially DataFrame serves as a wrapper around the underlying data, so these actually are different objects, it's just that certain components of them happen to be shared. As a basic example, you can add dummy attributes to one without impacting the other:
In [7]: df.foo = 'bar'
In [8]: df.foo
Out[8]: 'bar'
In [9]: df2.foo
---------------------------------------------------------------------------
AttributeError: 'DataFrame' object has no attribute 'foo'
To get around the issue of shared underlying data you'll need to explicitly tell the DataFrame constructor to copy the input data via the copy parameter:
In [10]: df2 = pd.DataFrame(df, copy=True)
In [11]: df._data is df2._data
Out[11]: False
In [12]: df['C'] = [1.1, 2.2, 3.3]
In [13]: df
Out[13]:
A B C
0 a 10 1.1
1 b 20 2.2
2 c 30 3.3
In [14]: df2
Out[14]:
A B
0 a 10
1 b 20
2 c 30

Related

Whether slicing of DataFrame in python return copy or reference to the original DataFrame

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
print(randn(5, 4))
df = pd.DataFrame( randn(5, 4), ['A', 'B', 'C', 'D', 'F'], ['W', 'X', 'Y', 'Z'] )
tmp_df = df['X']
print(type(tmp_df)) # Here type is Series (as expected)
tmp_df.loc[:] = 12.3
print(tmp_df)
print(df)
This code changes the content of (original) DataFrame df.
np.random.seed(101)
print(randn(5, 4))
df = pd.DataFrame( randn(5, 4), ['A', 'B', 'C', 'D', 'F'], ['W', 'X', 'Y', 'Z'] )
tmp_df = df.loc[['A', 'B'], ['W', 'X']]
print(type(tmp_df)) # Type is DataFrame
tmp_df.loc[:] = 12.3 # whereas, here when I change the content of the tmp_df it doesn't reflect on original array.
print(tmp_df)
print(df)
So, does that mean if we slice the Series out of DataFrame, reference is passed to sliced object.
Whereas, if it's DataFrame that has been sliced then it doesn't point to original DataFrame.
Please confirm whether my conclusion above is correct or not? Help would be appreciated.
To put it in a simple manner: Indexing with lists in loc always returns a copy.
Let's work with a DataFrame df:
df=pd.DataFrame({'A':[i for i in range(100)]})
df.head(3)
Output:
0 0
1 1
2 2
When we try to do an operation on the sliced data.
h=df.loc[[0,1,2],['A']]
h.loc[:] = 12.3
h
Output of h:
0 12.3
1 12.3
2 22.3
The results don't reflect like how it happened in your case:
df.head(3)
Output:
0 0
1 1
2 2
But when you're doing this tmp_df = df['X'], the series tmp_df is referring to contents of "X" in column df. Which is meant to change when you modify tmp_df.

Calculating weighted average using grouped .agg in pandas

I would like to calculate, by group, the mean of one column and the weighted mean of another column in a dataset using the .agg() function within pandas. I am aware of a few solutions, but they aren't very concise.
One solution has been posted here (pandas and groupby: how to calculate weighted averages within an agg, but it still doesn't seem very flexible because the weights column is hard coded in the lambda function definition. I'm looking to create a syntax closer to this:
(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
wt_avg_y=('y', 'weighted_mean', weights='weight')
)
Here is a fully worked example with code that seems needlessly complicated:
import pandas as pd
import numpy as np
# sample dataset
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>> group x y weights
#>>> 0 a 1 5 0.75
#>>> 1 a 2 6 0.25
#>>> 2 b 3 7 0.75
#>>> 3 b 4 8 0.25
# aggregation logic
summary = pd.concat(
[
df.groupby(['group']).x.mean(),
df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
Using the .apply() method on the entire DataFrame was the simplest solution I could arrive to that does not hardcode the column name inside the function definition.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.mean(x['x']),
np.average(x['y'], weights=x['weights'])
], index=['avg_x', 'wt_avg_y'])
)
.reset_index()
)
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
How about this:
grouped = df.groupby('group')
def wavg(group):
group['mean_x'] = group['x'].mean()
group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
return group
grouped.apply(wavg)
Try:
df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])
res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})
Outputs:
group x y
0 a 1.5 5.25
1 b 3.5 7.25
Since your weights sum to 1 within groups, you can assign a new column and groupby as usual:
(df.assign(wt_avg_y=df['y']*df['weights'])
.groupby('group')
.agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
.assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)
Output:
x wt_avg_y weights
group
a 1.5 5.25 1.0
b 3.5 7.25 1.0
Steven M. Mortimer's solution is clean and easy to read. Alternatively, one could use dict notation inside pd.Series() such that the index= argument is not needed. This provides slightly better readability in my opinion.
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series({
'avg_x' : np.mean(x['x']),
'wt_avg_y': np.average(x['y'], weights=x['weights'])
}))
.reset_index()
)

split a list of dictionaries into multiple columns

I have a dataframe with 30000 rows and 5 columns. one of this column is a list of dictionaries and a few Nan's. I wanted to split this column into 3 fields (legroom to In-FLight Enternatinment) and wanted to extract ratings
Below is a sample for reference
d = {'col1': [[{'rating': 5, 'ratingLabel': 'Legroom'}, {'rating': 5, 'ratingLabel': 'Seat comfort'}, {'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],'Nan']}
df = pd.DataFrame(data=d)
df
IIUC This should do the trick:
df=df["col1"].apply(lambda x: pd.Series({el["ratingLabel"]: el["rating"] for el in x if isinstance(x, list)}))
Output:
Legroom Seat comfort In-flight Entertainment
0 5.0 5.0 5.0
1 NaN NaN NaN
Here is a possible solution using the DataFrame.apply() and pd.Series and a strategy from Splitting dictionary/list inside a Pandas Column into Separate Columns
import pandas as pd
d = {'col1': [[{'rating': 5, 'ratingLabel': 'Legroom'},
{'rating': 5, 'ratingLabel': 'Seat comfort'},
{'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],
[{'rating': 5, 'ratingLabel': 'Legroom'},
{'rating': 5, 'ratingLabel': 'Seat comfort'},
{'rating': 5, 'ratingLabel': 'In-flight Entertainment'}],
'Nan']}
df = pd.DataFrame(data=d)
df
df_split = df['col1'].apply(pd.Series)
pd.concat([df,
df_split[0].apply(pd.Series).rename(columns = {'rating':'legroom_rating',
'ratingLabel':'1'}),
df_split[1].apply(pd.Series).rename(columns = {'rating':'seat_comfort_rating',
'ratingLabel':'2'}),
df_split[2].apply(pd.Series).rename(columns = {'rating':'in_flight_entertainment_rating',
'ratingLabel':'3'})],
axis = 1).drop(['col1','1','2','3',0],
axis = 1)
Producing the following DataFrame

Concatenate two 1 column DataFrames doesn't return both columns

I'm using Python 3.6 and I'm a newbie so thanks in advance for your patience.
I have a function that sums the difference between 3 points. It should then take the 'differences' and concatenate them with another DataFrame called labels. k and length are integers. I expected the resulting DataFrame to have two columns but it only has one.
Sample Code:
def distance(df1,df2,labels,k,length):
total_dist = 0
for i in range(length):
dist_dif = df1.iloc[:,i] - df2.iloc[:,i]
sq_dist = dist_dif ** 2
root_dist = sq_dist ** 0.5
total_dist = total_dist + root_dist
return total_dist
distance_df = pd.concat([total_dist, labels], axis=1)
distance_df.sort(ascending=False, axis=1, inplace=True)
top_knn = distance_df[:k]
return top_knn.value_counts().index.values[0]
Sample Data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': [0, 1,1]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
I expected the data to look something like this:
total_dist labels
0 1.715349 0
1 2.872991 1
2 4.344087 1
but instead it looks like this:
0 1.715349
1 4.344087
2 2.872991
dtype: float64
The output doesn't do the following:
1. Return the labels column data
2. Sort the data in descending order
If someone could point me in the right direction, I'd truly appreciate it.
Given two DataFrames, df1-df2 will perform the subtraction element-wise. Use abs() to take the absolute value of that difference, and finally sum each row. That's the explanation to the first command in the following function. The other lines are similar to your code.
import numpy as np
import pandas as pd
def calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels):
diff = np.sum(np.abs(df1-df2), axis=1) # np.sum(..., axis=1) sums the rows
diff.name = 'total_abs_distance' # Not really necessary, but just to refer to it later
diff = pd.concat([diff, labels], axis=1)
diff.sort_values(by='total_abs_distance', axis=0, ascending=True, inplace=True)
return diff
So for your example data:
d1 = {'Z_Norm_Age': [1.20, 2.58,2.54], 'Pclass': [3, 3, 2], 'Conv_Sex': [0, 1, 0]}
d2 = {'Z_Norm_Age': [-0.51, 0.24,0.67], 'Pclass': [3, 1, 3], 'Conv_Sex': [0, 1, 1]}
lbl = {'Survived': ['a', 'b', 'c']}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
labels = pd.DataFrame(data=lbl)
calc_abs_distance_between_rows_then_add_labels_and_sort(df1, df2, labels)
We get hopefully what you wanted:
total_abs_distance Survived
0 1.71 a
2 3.87 c
1 4.34 b
A few notes:
Did you really want the L1-norm? If you wanted the L2-norm (Euclidean distance), then replace the first command in that function above by np.sqrt(np.sum(np.square(df1-df2),axis=1)).
What's the purpose of those labels? Consider using the index of the DataFrames instead. Maybe it will fit your purposes better? For example:
# lbl_series = pd.Series(['a','b','c'], name='Survived') # Try this later instead of lbl_list, to further explore the wonders of Pandas indexes :)
lbl_list = ['a', 'b', 'c']
df1.index = lbl_list
df2.index = lbl_list
# Then the L1-norm is simply this:
np.sum(np.abs(df1 - df2), axis=1).sort_values()
# Whose output is the Series: (with the labels as its index)
a 1.71
c 3.87
b 4.34
dtype: float64

Using `.at` or `.iat` scalar access methods and boolean indexing on pandas DataFrames

I found out about .at and .iat methods of pandas DataFrames for fast scalar indexing.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#fast-scalar-value-getting-and-setting
Is there a way to combine them with boolean indexing?
In [1]: import pandas as pd
In [2]: data = {
...: "A": [1, 2],
...: "B": [3, 4]
...: }
In [3]: df = pd.DataFrame(data)
In [4]: df.index = ["x", "y"]
In [5]: df
Out[5]:
A B
x 1 3
y 2 4
In [6]: df.ix[df.A == 1, "B"]
Out[6]:
x 3
Name: B, dtype: int64
In [7]: df.ix[df.A == 1, "B"].values[0]
Out[7]: 3
In [8]: df.at["x", "B"]
Out[8]: 3
In [9]: df.at[df.A == 1, "B"]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-e2b7f23503ca> in <module>()
----> 1 df.at[df.A == 1, "B"]
/home/jlcano/.miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1663
1664 key = self._convert_key(key)
-> 1665 return self.obj.get_value(*key, takeable=self._takeable)
1666
1667 def __setitem__(self, key, value):
/home/jlcano/.miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in get_value(self, index, col, takeable)
1898 series = self._get_item_cache(col)
1899 engine = self.index._engine
-> 1900 return engine.get_value(series.get_values(), index)
1901
1902 def set_value(self, index, col, value, takeable=False):
pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3557)()
pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3240)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3986)()
TypeError: 'x True
y False
Name: A, dtype: bool' is an invalid key
This is the easiest solution I have found:
In [10]: df.at[df[df.A == 1].index.tolist()[0], "B"]
Out[10]: 3
IIUC you can do it this way:
In [131]: df
Out[131]:
A B
x 1 3
y 2 4
z 1 5
In [132]: df.at[(df.A == 1).idxmax(), 'B']
Out[132]: 3

Resources