Display sparse multi-index of dataframe in Pandas - python-3.x

Is this a bug for display sparse multi-index in Pandas.
(Running with iPython 6.2.1, Python 3.6.4)
In [1]:
import pandas as pd
from io import StringIO
data = """
code\tname\ttyp\tntf\n
A5411\tWD\tAF\t\n
A5411\tWD\tAF\t210194618\n
B5498\tSH\tNC\t\n
B5498\tSH\tNC\t210213014\n
"""
df = pd.read_table(StringIO(data))
In [2]: df.set_index(['name','code'])
Out[2]:
typ ntf
name code
WD A5411 AF NaN
A5411 AF 210194618.0
SH B5498 NC NaN
B5498 NC 210213014.0
I am expecting the Out[2] should be something like Out[3]
In [3]: df.set_index(['name', 'code', 'typ'])
Out[3]:
ntf
name code typ
WD A5411 AF NaN
AF 210194618.0
SH B5498 NC NaN
NC 210213014.0
Any idea on this?

Related

How to get residuals from statsmodels AutoRegResults? model.resid returns all NaN

I am trying to plot the residuals from statsmodels' AutoRegResults, but results.resid only returns NaN when I call the method. However, when I call plot_diagnostics() it is able to plot the regularized residuals with no issues. How can I get the actual residuals?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
import warnings
df=pd.read_csv('Bank_of_England_Database.csv',
sep=',',
parse_dates=["Date"],
dayfirst=True,
index_col="Date")
df.rename({list(df.columns)[-1] : 'Spot Exchange Rate'},
axis='columns',
inplace=True)
df['RW 11'] = df.rolling(window=11, min_periods=11, center=True).mean()
xbar = df['Spot Exchange Rate'].mean()
df['demean'] = df['Spot Exchange Rate'] - xbar
fig = plt.figure()
fig.suptitle("AR(p) Residuals")
lags = [1] #, 2, 3]
for lag in lags:
warnings.filterwarnings("ignore") # Stops a FutureWarning and ValueWarning about dates
model = AutoReg(df['demean'], lags=lag)
results = model.fit()
resid = results.resid # Returns NaN
print(resid.head())
plt.plot(df.index[lag:], resid, label=f"lag={lag}")
results.plot_diagnostics()
plt.show()
Date
2015-05-01 NaN
2015-05-05 NaN
2015-05-06 NaN
2015-05-07 NaN
2015-05-08 NaN
dtype: float64
No handles with labels found to put in legend.
My residual plot, which is just NaN
plot_diagnostics
EDIT
Updated code showing version with the same issue:
import pandas as pd
import numpy as np
from statsmodels.tsa.api import AutoReg
import statsmodels as sm
import matplotlib.pyplot as plt
print(f"statsmodel version: {sm.__version__}")
df=pd.read_csv('Bank_of_England_Database.csv',
sep=',',
parse_dates=["Date"],
dayfirst=True,
index_col="Date")
df.rename({list(df.columns)[-1] : 'Spot Exchange Rate'},
axis='columns',
inplace=True)
df['demean'] = df['Spot Exchange Rate'] - df['Spot Exchange Rate'].mean()
res = AutoReg(df['demean'], lags=2).fit()
results.plot_diagnostics()
print(f"All NaN: {np.isnan(res.resid).all()}")
plt.show()
Results:
statsmodel version: 0.12.2
/usr/local/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:581: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
warnings.warn('A date index has been provided, but it has no'
/usr/local/lib/python3.9/site-packages/statsmodels/tsa/ar_model.py:248: FutureWarning: The parameter names will change after 0.12 is released. Set old_names to False to use the new names now. Set old_names to True to use the old names.
warnings.warn(
All NaN True
It doesn't seem to be possible to reproduce this issue using statsmodels 0.12.2.
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.api import AutoReg
import numpy as np
y = ArmaProcess.from_coeffs([1.8,-0.9]).generate_sample(250)
res = AutoReg(y,lags=2,old_names=False).fit()
print(f"All finite (no nan/inf): {np.all(np.isfinite(res.resid))}")
# Try a Series
ys = pd.Series(y, index=pd.date_range("1900-1-1",periods=250,freq="M"))
res = AutoReg(ys,lags=2,old_names=False).fit()
print(f"All finite using Series (no nan/inf): {np.all(np.isfinite(res.resid))}")
# Try a Series with no freq
idx = pd.date_range("1900-1-1",periods=750,freq="M")
uneven_idx = sorted(np.random.default_rng().choice(idx, size=250, replace=False))
ys = pd.Series(y, index=uneven_idx)
res = AutoReg(y,lags=2,old_names=False).fit()
print(f"All finite using Series w/o freq (no nan/inf): {np.all(np.isfinite(res.resid))}")
This produces
All finite (no nan/inf): True
All finite using Series (no nan/inf): True
All finite using Series w/o freq (no nan/inf): True
Upgrading to the latest release might be needed.

Create a flask app to return list of dictionary from xlsx file in python3

I have a dataframe as shown below, which is in xlsx format
Date in_days t_factor
2020-02-01 1 5
2020-02-06 6 14
2020-02-09 9 14
2020-02-03 3 11
2020-02-11 11 14
I would like to create a flask app to read this data from its location and convert it back to list of dictionary.
I am first time in flask
I tried below code in jupyter notebook to return list of dictionary
import datetime as dt
import pandas as pd
df = pd.read_excel('data.xlsx')
def get_function_model_data(caliberation_df):
df = caliberation_df.copy()
df.fillna(0, inplace=True)
df['Date'] = df['Date'].apply(lambda x:
dt.datetime.strftime(x,'%Y-%m-%d'))
df.rename(columns={'in_days':'InDays'}, inplace=True)
df = df [['Date', 'InDays']]
return list(df.T.to_dict().values())
I would like to create a flask app to read the excel from its location and return the list of dictionary of selected column.
I tried below code
import pandas as pd
import numpy as np
import datetime as dt
from flask import Flask
app = Flask(__name__)
def get_function_model_data():
df = pd.read_excel('data.xlsx')
df.fillna(0, inplace=True)
df['Date'] = df['Date'].apply(lambda x:
dt.datetime.strftime(x,'%Y-%m-%d'))
df.rename(columns={'in_days':'InDays'}, inplace=True)
df = df [['Date', 'InDays']]
return list(df.T.to_dict().values())

Python - dataframe url parsing issue

I am trying to get domain names from the url from a column into another column. Its working on a string like object, when I apply to dataframe it doesn't work. How to do I apply this to a data frame?
Tried:
from urllib.parse import urlparse
import pandas as pd
id1 = [1,2,3]
ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
df = pd.DataFrame({'id':id1,'url':ls})
df
# urlparse(df['url']) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# df['url'].map(urlparse) # AttributeError: 'float' object has no attribute 'decode'
working on string:
string = 'https://google.com/tensoflow'
parsed_uri = urlparse(string)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
result
looking for a column:
col3
https://google.com/
https://math.com/
nan
Errror
You can try something like this.
Here I have used pandas.Series.apply() to solve.
» Initialization and imports
>>> from urllib.parse import urlparse
>>> import pandas as pd
>>> id1 = [1,2,3]
>>> import numpy as np
>>> ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
>>> ls
['https://google.com/tensoflow', 'https://math.com/some/website', nan]
>>>
» Inspect the newly created DataFrame.
>>> df = pd.DataFrame({'id':id1,'url':ls})
>>> df
id url
0 1 https://google.com/tensoflow
1 2 https://math.com/some/website
2 3 NaN
>>>
>>> df["url"]
0 https://google.com/tensoflow
1 https://math.com/some/website
2 NaN
Name: url, dtype: object
>>>
» Applying a function using pandas.Series.apply(func) on url column..
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else np.nan)
0 https://google.com/
1 https://math.com/
2 NaN
Name: url, dtype: object
>>>
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
>>>
» Store the above result in a variable (not mandatory, just to simply).
>>> s = df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
>>> s
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
» Finally
>>> df2 = pd.DataFrame({"col3": s})
>>> df2
col3
0 https://google.com/
1 https://math.com/
2 nan
>>>
» To make sure, what is s and what is df2, check types (again, not mandatory).
>>> type(s)
<class 'pandas.core.series.Series'>
>>>
>>>
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
>>>
Reference links:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html

Python 3: Import files of floats and give them unique identifiers for plotting

I have two files named "Posterior_C.txt" and "Posterior_l.txt", each containing 5000 float entries, that I would like to import and concatenate into a dataframe (for plotting in seaborn). Each entry belonging to Posterior_C should be given a label C and each entry belonging to Posterior_l should be called l.
How can I import the data and concatenate them, while creating an unique identifier for each. E.g.
0.012 Posterior_C
0.0021 Posterior_C
0.2 Posterior_l
0.52 Posterior_l
This is what I've got so far:
import pandas as pd
import numpy as np
C=np.loadtxt("Posterior_C.txt")
l=np.loadtxt("Posterior_l.txt")
df={C,l}
df=pd.DataFrame(df)
import numpy as np
xc = np.array(["C"])
c=np.repeat(xc, 5000, axis=0)
import numpy as np
xl = np.array(["l"])
l=np.repeat(xl, 5000, axis=0)
But a bit stuck now.
*In R i would do *
C<-read.table("Posterior_C.txt,header=FALSE)
l<-read.table("Posterior_l.txt,header=FALSE)
df=rbind(C,l)
df<-as.data.frame(df)
dfID=rbind(rep("C",NROW(C),rep("l",NROW(l))
df$ID<-cbind(df,dfID[,1] )
or something similar.
Something like this:
c = pd.read_table("Posterior_C.txt", header=None)
l = pd.read_table("Posterior_l.txt", header=None)
c['ID'] = 'C'
l['ID'] = 'l'
df = pd.concat([c, l], ignore_index=True)

Reading a pickled file, unequal length error in pandas dataframe

I want to read a pickle file in python 3.5. I am using the following code.
The following is my output, I want to load it as pandas dataframe.
when I try to convert into pd Dataframe, using df = pd.DataFrame(df), I am getting the below error.
ValueError: arrays must all be same length
link to data- https://drive.google.com/file/d/1lSFBPLbUCluWfPjzolUZKmD98yelTSXt/view?usp=sharing
I think you need dict comprehension with concat:
from pandas.io.json import json_normalize,
import pickle
fh = open("imdbnames40.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
print (df.head())
ethnicity score best
'Aina Rapoza 0 Asian 0.89 Asian
1 GreaterAfrican 0.05 Asian
2 GreaterEuropean 0.06 Asian
3 IndianSubContinent 0.11 GreaterEastAsian
4 GreaterEastAsian 0.89 GreaterEastAsian
Then if need column from first level of MultiIndex:
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()

Resources