Trying to divide two columns of a dataframe but get Nan - python-3.x

Background:
I deal with a dataframe and want to divide the two columns of this dataframe to get a new column. The code is shown below:
import pandas as pd
df = {'drive_mile': [15.1, 2.1, 7.12], 'price': [40, 9, 31]}
df = pd.DataFrame(df)
df['price/km'] = df[['drive_mile', 'price']].apply(lambda x: x[1]/x[0])
print(df)
And I get the below result:
drive_mile price price/km
0 15.10 40 NaN
1 2.10 9 NaN
2 7.12 31 NaN
Why would this happen? And how can I fix it?

As pointed out in the comments, you missed the axis=1 parameter to perform the division on the right dimension using apply. This is because you end up with different indices when joining back in the DataFrame.
However, more importantly, do not use apply to perform a division!. Apply is often much less efficient compared to vectorial operations.
Use div:
df['price/km'] = df['drive_mile'].div(df['price'])
Or /:
df['price/km'] = df['drive_mile']/df['price']

Related

Add Column For Results Of Dataframe Resample [duplicate]

I have the following data frame in IPython, where each row is a single stock:
In [261]: bdata
Out[261]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21210 entries, 0 to 21209
Data columns:
BloombergTicker 21206 non-null values
Company 21210 non-null values
Country 21210 non-null values
MarketCap 21210 non-null values
PriceReturn 21210 non-null values
SEDOL 21210 non-null values
yearmonth 21210 non-null values
dtypes: float64(2), int64(1), object(4)
I want to apply a groupby operation that computes cap-weighted average return across everything, per each date in the "yearmonth" column.
This works as expected:
In [262]: bdata.groupby("yearmonth").apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
Out[262]:
yearmonth
201204 -0.109444
201205 -0.290546
But then I want to sort of "broadcast" these values back to the indices in the original data frame, and save them as constant columns where the dates match.
In [263]: dateGrps = bdata.groupby("yearmonth")
In [264]: dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/mnt/bos-devrnd04/usr6/home/espears/ws/Research/Projects/python-util/src/util/<ipython-input-264-4a68c8782426> in <module>()
----> 1 dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
TypeError: 'DataFrameGroupBy' object does not support item assignment
I realize this naive assignment should not work. But what is the "right" Pandas idiom for assigning the result of a groupby operation into a new column on the parent dataframe?
In the end, I want a column called "MarketReturn" than will be a repeated constant value for all indices that have matching date with the output of the groupby operation.
One hack to achieve this would be the following:
marketRetsByDate = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
bdata["MarketReturn"] = np.repeat(np.NaN, len(bdata))
for elem in marketRetsByDate.index.values:
bdata["MarketReturn"][bdata["yearmonth"]==elem] = marketRetsByDate.ix[elem]
But this is slow, bad, and unPythonic.
In [97]: df = pandas.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})
In [98]: df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r')
Out[98]:
A B month A_r
0 -0.040710 0.182269 0 -0.331816
1 -0.004867 0.642243 1 2.448232
2 -0.162191 0.442338 4 2.045909
3 -0.979875 1.367018 5 -2.736399
4 -1.126198 0.338946 5 -2.736399
5 -0.992209 -1.343258 1 2.448232
6 -1.450310 0.021290 0 -0.331816
7 -0.675345 -1.359915 9 2.722156
While I'm still exploring all of the incredibly smart ways that apply concatenates the pieces it's given, here's another way to add a new column in the parent after a groupby operation.
In [236]: df
Out[236]:
yearmonth return
0 201202 0.922132
1 201202 0.220270
2 201202 0.228856
3 201203 0.277170
4 201203 0.747347
In [237]: def add_mkt_return(grp):
.....: grp['mkt_return'] = grp['return'].sum()
.....: return grp
.....:
In [238]: df.groupby('yearmonth').apply(add_mkt_return)
Out[238]:
yearmonth return mkt_return
0 201202 0.922132 1.371258
1 201202 0.220270 1.371258
2 201202 0.228856 1.371258
3 201203 0.277170 1.024516
4 201203 0.747347 1.024516
As a general rule when using groupby(), if you use the .transform() function pandas will return a table with the same length as your original. When you use other functions like .sum() or .first() then pandas will return a table where each row is a group.
I'm not sure how this works with apply but implementing elaborate lambda functions with transform can be fairly tricky so the strategy that I find most helpful is to create the variables I need, place them in the original dataset and then do my operations there.
If I understand what you're trying to do correctly first you can calculate the total market cap for each group:
bdata['group_MarketCap'] = bdata.groupby('yearmonth')['MarketCap'].transform('sum')
This will add a column called "group_MarketCap" to your original data which would contain the sum of market caps for each group. Then you can calculate the weighted values directly:
bdata['weighted_P'] = bdata['PriceReturn'] * (bdata['MarketCap']/bdata['group_MarketCap'])
And finally you would calculate the weighted average for each group using the same transform function:
bdata['MarketReturn'] = bdata.groupby('yearmonth')['weighted_P'].transform('sum')
I tend to build my variables this way. Sometimes you can pull off putting it all in a single command but that doesn't always work with groupby() because most of the time pandas needs to instantiate the new object to operate on it at the full dataset scale (i.e. you can't add two columns together if one doesn't exist yet).
Hope this helps :)
May I suggest the transform method (instead of aggregate)? If you use it in your original example it should do what you want (the broadcasting).
I did not find a way to make assignment to the original dataframe. So I just store the results from the groups and concatenate them. Then we sort the concatenated dataframe by index to get the original order as the input dataframe. Here is a sample code:
In [10]: df = pd.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})
In [11]: df.head()
Out[11]:
month A B
0 4 -0.029106 -0.904648
1 2 -2.724073 0.492751
2 7 0.732403 0.689530
3 2 0.487685 -1.017337
4 1 1.160858 -0.025232
In [12]: res = []
In [13]: for month, group in df.groupby('month'):
...: new_df = pd.DataFrame({
...: 'A^2+B': group.A ** 2 + group.B,
...: 'A+B^2': group.A + group.B**2
...: })
...: res.append(new_df)
...:
In [14]: res = pd.concat(res).sort_index()
In [15]: res.head()
Out[15]:
A^2+B A+B^2
0 -0.903801 0.789282
1 7.913327 -2.481270
2 1.225944 1.207855
3 -0.779501 1.522660
4 1.322360 1.161495
This method is pretty fast and extensible. You can derive any feature here.
Note: If the dataframe is too large, concat may cause you MMO error.

pandas df are being read as dict

I'm having some trouble with pandas. I opened a .xlsx file with pandas, but when I try to filter any information, it shows me the error
AttributeError: 'dict' object has no attribute 'head' #(or iloc, or loc, or anything else from DF/pandas)#
So, I did some research and realized that my table turned into a dictionary (why?).
I'm trying to convert this mess into a proper dictionary, so I can convert it into a properly df, because right now, it shows some characteristics from both. I need a df, just it.
Here is the code:
import pandas as pd
df = pd.read_excel('report.xlsx', sheet_name = ["May"])
print(df)
Result: it shows the table plus "[60 rows x 24 columns]"
But when I try to filter or iterate, it shows all dicts possible attibute errors.
Somethings I tried: .from_dict, xls.parse/(df.to_dict).
When I try to convert df to dict properly, it shows
ValueError: If using all scalar values, you must pass an index
I tried this link: [https://stackoverflow.com/questions/17839973/constructing-pandas-dataframe-from-values-in-variables-gives-valueerror-if-usi)][1], but it didn't work. For some reason, it said in one of the errors that I should provide 2-d parameters, that's why I tried to create a new dict and do a sort of 'append', but it didn't work too...
Then I tried all stuff to set an index, but it doesn't let me rename columns because it says .iloc is not an attribute from dict)
I'm new in python, but I never saw a 'pd.read_excel' open a DataFrame as 'dict'. What should I do?
tks!
[1]: Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"
if its a dict of DataFrames try...
>>> dict_df = {"a":pd.DataFrame([{1:2,3:4},{1:4,4:6}]), "b":pd.DataFrame([{7:9},{1:4}])}
>>> dict_df
{'a': 1 3 4
0 2 4.0 NaN
1 4 NaN 6.0, 'b': 7 1
0 9.0 NaN
1 NaN 4.0}
>>> pd.concat(dict_df.values(),keys=dict_df.keys(), axis=1)
a b
1 3 4 7 1
0 2 4.0 NaN 9.0 NaN
1 4 NaN 6.0 NaN 4.0

Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)

Contructing new dataframe and keeping old one?

I am sorry for asking a naive question but it's driving me crazy at the moment. I have a dataframe df1, and creating new dataframe df2 by using it, as following:
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1)
df2 =NewDF(df1)
print(df1)
which gives
a b
0 1 4
1 2 5
2 3 6
a b sum
0 1 4 5
1 2 5 7
2 3 6 9
Why I am loosing df1 shape and getting third column? How can I avoid this?
DataFrames are mutable so you should either explicitly pass a copy to your function, or have the first step in your function copy the input. Otherwise, just like with lists, any modifications your functions make also apply to the original.
Your options are:
def NewDF(df):
df = df.copy()
df['sum']=df['a']+df['b']
return df
df2 = NewDF(df1)
or
df2 = NewDF(df1.copy())
Here we can see that everything in your original implementation refers to the same object
import pandas as pd
def NewDF(df):
print(id(df))
df['sum']=df['a']+df['b']
print(id(df))
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(id(df1))
#2242099787480
df2 = NewDF(df1)
#2242099787480
#2242099787480
print(id(df2))
#2242099787480
The third column that you are getting is the Index column, each pandas DataFrame will always maintain an Index, however you can choose if you don't want it in your output.
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1.to_string(index=False))
df2 =NewDF(df1)
print(df1.to_string(index = False))
Gives the output as
a b
1 4
2 5
3 6
a b sum
1 4 5
2 5 7
3 6 9
Now you might have the question why does index exist, Index is actually a backed hash table which increases speed and is a highly desirable feature in multiple contexts, If this was just a one of question, this should be enough, but if you are looking to learn more about pandas and I would advice you to look into indexing, you can begin by looking here https://stackoverflow.com/a/27238758/10953776

Re-indexing a pandas Series extracted from DataFrame

Problematic
Let us say that I have a DataFrame df of n rows like this:
| Precipitation | Discharge |
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
...
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
Rows are automatically indexed from 1 to n. When we extract one column for example:
series = df.loc[5:,['Precipitation']]
or
series = df.Precipitation[5:]
The extracted series tends to be like:
Precipitation
5 16
6 17
7 18
...
n 15
So the question is how can we modify the generic indexing from 5 to n to 0 to n-5.
Note that I have tried series.reindex() and series.reset_index() but neither of them works...
Currently I do series.tolist() to solve the problem but is there any way more elegant and smarter?
Many thanks in advance !
If I understand your question correctly, you want to select the first n-5 rows of your column.
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df.loc[:len(df)-5,['col1']]
You may want to read up on slicing with pandas: https://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges
EDIT:
I think I misunderstood, and what you instead want is to 'reset' your index. You can do that with the method reset_index()
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df = df.loc[5:,['col1']]
df.reset_index(drop=True)

Resources