Getting an error when calculating standard deviation using Pandas - python-3.x

I am trying to calculate standard deviation of multiple columns using two variables in the groupby. However, my code throws in an error and I am having a hard time figuring it out.
I am using https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ as a guide.
Below is a sample dataframe:
Book Home Num Pointspread odds
A P -135 -2.5 -110.0
B P NaN -3 -101.0
B P NaN -3 -110.0
C P NaN -3 -120.0
B P NaN -3 -100.0
and this the the code I wrote
home_std_dev = home_analysis_data.groupby('Book','Home').agg({'Num':'std',
'Pointspread':'std',
'odds':'std'})
The code above gives me an error
ValueError: No axis named Home for object type <class 'type'>
I don't know what this error means and how to solve the issue. I am expecting to see a table with the standard deviation of the columns grouped by the two variables. Any help will be appreciated.
Since I'm quite new to python, please let me know if there is a better way to approach this issue. Thank you!

Use list in groupby - ['Book','Home'] for grouping by multiple columns:
home_std_dev = home_analysis_data.groupby(['Book','Home']).agg({'Num':'std',
'Pointspread':'std',
'odds':'std'})

Related

Multilevel index of rows of a dataframe using pandas [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Combine multiple rows based on Id and other column using pandas python [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Pandas apply with eval not giving NAN as result when NAN in column its calculating on

I have to support the ability for user to run any formula against a frame to produce a new column.
I may have a frame that looks like
dim01 dim02 msr01
0 A 25 1.0
1 B 26 5.3
2 C 53 NaN
I interpret user code to allow them to run a formula using supported functions/ standard operators / other columns
So a formula might look like SQRT([msr01]*100+7)
I convert the user input to Python syntax so this would evaluate to something like
formula_str = '(math.sqrt((row.msr01*100)+7))'
I then apply it to my pandas dataframe like this
data_frame['msr002'] = data_frame.apply(lambda row: eval(formula_str), axis=1)
This was working good until I hit data with a NaN in a column used in the calculation. I noticed that when this case happens I get a frame like this in return.
dim01 dim02 msr01 msr02
0 A 25 1.0 10.344
1 B 26 5.3 23.173
2 C 53 NaN 7.342
So it appears that the eval is not evaluating the NaN correctly.
I am using a lexer/parser to ensure that the user sent formula isnt dangerous and to convert from everyday user syntax to use python functions and make it work against pandas columns.
Any advice on how to fix this?
Perhaps I should include something in the lambda that looks if any required column is NaN and just hardcode to Nan in that case? But that doesn't seem like the best solution to me.
I did see this question which is similar but didnt think it answered my exact need.
So you can try with
df.msr01.mul(100).add(7)**0.5
Out[716]:
0 10.34408
1 23.17326
2 NaN
Name: msr01, dtype: float64
Also with your original code
df.apply(lambda row: eval(formula_str), axis=1)
Out[714]:
0 10.34408
1 23.17326
2 NaN
dtype: float64

How to add missing data from one dataframe to another?

I am working on a project that requires to fill out missing data from one Excel sheet to another. For example:
table A:
card name address zipcode
123 steve chicago 60601
321 Joy New York 10083
222 Andy San Francisco 43211
table B:
card name address zipcode
321 steve nan nan
123 Joy nan nan
123 nan nan nan
For this project, I need fill out table B according to table A. I do have idea about using Excel VLOOKUP function to fill out all of columns, but I guess if the number of data file getting huge in future, then I may use python to do this. (eg, same data format but from different branches)
In Python, the merge function can do this but it takes too much time. Is there any useful function in pandas, numpy, or any other third-party library that can help me do this? Thanks all!
Here is what I have tried:
df.merge(table A, table B, on = 'card', how = 'right')
it does work but I have to rename columns to match each features. And I also know we can do this on SQL very fast and effiency, just wanna do this on python :)
Of course pandas library can do this and more. I am currently writing a business intelligence program. And I do a lot of operations like this with pandas
There are many ways to do this, but since I don't see your code, you can do it in the simplest and most understandable way. Turn at the point where you are stuck.thank you
searchdata = Atabledata[['name','adress','zipcode']]
for i in search['name']:
Btabledata.loc[Btabledata['name']== i, Btabledata['adress']] = Atabledata['adress']
Btabledata.loc[Btabledata['name'] == i, Btabledata['zipcode']] = Atabledata['zipcode']
print(Btabledata)

How to fill missing data in excel time series

I need a hand on this problem: In an Excel workbook I reported 10 time series (with monthly frequency) of 10 titles that should cover the past 15 years. Unfortunately, not all titles can cover the 15-year time series. For example, a title only goes up to 2003; So in the column of that title, I have the first 5 years with a "Not Available" instead of a value. Once I’have imported the data into Matlab, obviously, in the column of the title with the shorter series appears NaN where there are no values.
>> Prices = xlsread('PrezziTitoli.xls');
>> whos
Name Size Bytes Class Attributes
Prices 182x10 6360 double
My goal is to estimate the variance-covariance matrix, however, because of the lack of data, the calculation is not possible for me. I thought to an interpolation, before the calculation of the variance-covariance matrix, to cover the values that in Matlab return NaN, for example with a "fillts", but have difficulties in its use.
There is some code that can be useful to me? Can you help me?
Thanks!
Do you have the statistics toolbox installed? In that case, the solution is simple:
>> x = randn(10,4); // x is a 10x4 matrix of random numbers
>> x(randi(40,10,1)) = NaN; // set some random entries to NaN
>> disp(x)
-1.1480 NaN -2.1384 2.9080
0.1049 -0.8880 NaN 0.8252
0.7223 0.1001 1.3546 1.3790
2.5855 -0.5445 NaN -1.0582
-0.6669 NaN NaN NaN
NaN -0.6003 0.1240 -0.2725
-0.0825 0.4900 1.4367 1.0984
-1.9330 0.7394 -1.9609 -0.2779
-0.4390 1.7119 -0.1977 0.7015
-1.7947 -0.1941 -1.2078 -2.0518
>> nancov(x) // Compute covariances after removing all NaN rows
1.2977 0.0520 1.6248 1.3540
0.0520 0.5359 -0.0967 0.3966
1.6248 -0.0967 2.2940 1.6071
1.3540 0.3966 1.6071 1.9358
>> nancov(x, 'pairwise') // Compute covariances pairwise, ignoring NaNs
1.9195 -0.5221 1.4491 -0.0424
-0.5221 0.7325 -0.1240 0.2917
1.4491 -0.1240 2.1454 0.2279
-0.0424 0.2917 0.2279 2.1305
If you don't have the statistics toolbox, we need to think harder - let me know!

Resources